Ghislain Bidaut
(ghislain.bidaut@inserm.fr)
IR Cibi Platform (CRCM Integrative Bioinformatics)
Web: http://cibi.marseille.inserm.fr
Forge: http://forgecrcm.marseille.inserm.fr/projects/cibi
SNP detection from NGS data with details on
This will be illustrated through an NGS data analysis in E. coli in order to extract variants.
Prerequisites
Plan
Command line is
As opposed to GUI (Graphics User Interface), which is
Bioinformatics usually start with sequence data from sequencing platform.
In the present case, we are using Fastq files generated by Illumina
We assume QC is done.
1 $ cd data/v-pages/raw-data/Clean/151
2 $ ls
3 total 1459888
4 -rwx------  1 bidaut  staff  344158454 13 oct 10:35 FCHL5CWBBXX_L1_ECOijqRAADFAAPEI-97_1.fq.gz
5 -rwx------  1 bidaut  staff  403297004 13 oct 10:34 FCHL5CWBBXX_L1_ECOijqRAADFAAPEI-97_2.fq.gz
Supported, published and documented all-in-one package
Liao Y, Smyth GK and Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research, 41(10):e108, 2013
Conda is a general purpose installer for Linux Mac and windows featuring a bioinformatics oriented repository (Bioconda).
It creates environnements in which you can specifies versions of programs that will live in that environment.
Conda will download needed programs from repositories and fix dependancies.
A specific bioinformatics repository exist: Bioconda.
 1 $ conda config --add channels defaults
 2 $ conda config --add channels conda-forge
 3 $ conda config --add channels bioconda
 4 
 5 $ conda create --name subreadalign
 6 Fetching package metadata ...............
 7 Solving package specifications:
 8 Package plan for installation in environment /Users/bidaut/anaconda3/envs/subreadalign:
 9 
10 Proceed ([y]/n)? yes
11 
12 #
13 # To activate this environment, use:
14 # > source activate subreadalign
15 #
16 # To deactivate an active environment, use:
17 # > source deactivate
18 #
 1 $ source activate subreadalign
 2 (subreadalign) 11:23:28-bidaut@manhattan-2:~$
 3 $ conda list
 4 # packages in environment at /Users/bidaut/anaconda3/envs/subreadalign:
 5 #
 6 $ conda search subread
 7 Fetching package metadata ...............
 8 bioconductor-rsubread        1.22.1                 r3.2.2_0  bioconda
 9 1.23.0                 r3.3.1_0  bioconda
10 1.25.2                 r3.3.1_0  bioconda
11 1.25.2                 r3.3.2_0  bioconda
12 1.25.2                 r3.4.1_0  bioconda
13 1.26.1                 r3.4.1_0  bioconda
14 1.28.0                 r3.4.1_0  bioconda
15 subread                      1.5.0p3                       0  bioconda
16 1.5.0                         0  bioconda
17 1.5.0.post3                   0  bioconda
18 1.5.2                         0  bioconda
19 1.5.3                         0  bioconda
20 1.5.3                         1  bioconda
21 1.6.0                         0  bioconda
22 1.6.0                         1  bioconda
23 1.6.0                         2  bioconda
 1 $ conda install subread
 2 Fetching package metadata ...............
 3 Solving package specifications: .
 4 
 5 Package plan for installation in environment /Users/bidaut/anaconda3/envs/subreadalign:
 6 
 7 The following NEW packages will be INSTALLED:
 8 
 9 subread: 1.6.0-2  bioconda
10 zlib:    1.2.11-0 conda-forge
11 
12 Proceed ([y]/n)?
13 zlib-1.2.11-0. 100% |##################################################################################| Time: 0:00:00 276.52 kB/s
14 subread-1.6.0- 100% |##################################################################################| Time: 0:00:02   3.94 MB/s
1 $ conda list
2 # packages in environment at /Users/bidaut/anaconda3/envs/subreadalign:
3 #
4 subread                   1.6.0                         2    bioconda
5 zlib                      1.2.11                        0    conda-forge
6 $  subread-align -version
7 
8 Subread-align v1.6.0
Subread est installé!
Ref Genome for alignment: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/dna/
! bash
$ mkdir ref-genome
$ cd ref-genome
$ wget ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/dna/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.chromosome.Chromosome.fa
Annotations used for IGV: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655
$ wget ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655
1 $  subread-buildindex ref_genome/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.chromosome.Chromosome.fa  -o Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.chromosome.Chromosome.fa.index
$ cd $(MYFASTADIR)
$ ls
FCHL5CWBBXX_L1_ECOijqRAADFAAPEI-97_1.fq.gz  FCHL5CWBBXX_L1_ECOijqRAADFAAPEI-97_2.fq.gz
$ subread-align -T 20 --sv -i ref_genome/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.chromosome.Chromosome.fa -t 1 -r FCHL5CWBBXX_L1_ECOijqRAADFAAPEI-97_1.fq.gz  -R FCHL5CWBBXX_L1_ECOijqRAADFAAPEI-97_2.fq.gz -o 151.bam
$ ls
$ 151.bam
$
$ samtools sort 151.bam > 151.sorted.bam
$ samtools index 151.sorted.bam
$ ls
151.bam         151.sorted.bam       151.sorted.bam.bai
1 exactSNP -T 20 -f 0.3  -b -i 151.sorted.bam.bai -g ref_genome/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.chromosome.Chromosome.fa -o 151.vcf
 1 $ less 151.vcf
 2 ##fileformat=VCFv4.0
 3 ##comment=The QUAL values for the SNPs in this VCF file are calculated as min(40, - log_10 (p_value)), where p_value is from the Fisher's Exact Test. The QUAL values for the Indels in this VCF file are always 1.0.
 4 ##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
 5 ##INFO=<ID=BGMM,Number=1,Type=Integer,Description="Number of mismatched bases in the background (for SNP only)">
 6 ##INFO=<ID=BGTOTAL,Number=1,Type=Integer,Description="Total number of bases in the background (for SNP only)">
 7 ##INFO=<ID=MM,Number=1,Type=String,Description="Number of supporting reads for each alternative allele (for SNP only)">
 8 ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
 9 ##INFO=<ID=SR,Number=1,Type=Integer,Description="Number of supporting reads (for INDEL only)">
10 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
11 Chromosome      26333   .       C       T       40.0000 .       DP=93;MM=93;BGTOTAL=936;BGMM=0
12 Chromosome      363282  .       G       T       40.0000 .       DP=206;MM=87;BGTOTAL=2058;BGMM=0
13 Chromosome      363761  .       T       C       40.0000 .       DP=213;MM=99;BGTOTAL=2119;BGMM=0
14 Chromosome      366409  .       C       T       2.3222  .       DP=2;MM=2;BGTOTAL=19;BGMM=0
15 Chromosome      807318  .       G       C       40.0000 .       DP=42;MM=42;BGTOTAL=256;BGMM=0
16 Chromosome      807830  .       C       T       40.0000 .       DP=107;MM=107;BGTOTAL=1075;BGMM=0
17 Chromosome      807856  .       A       G       40.0000 .       DP=110;MM=110;BGTOTAL=1090;BGMM=0
18 Chromosome      1207789 .       C       G       40.0000 .       DP=122;MM=39;BGTOTAL=1147;BGMM=0
19 Chromosome      1310570 .       C       T       40.0000 .       DP=143;MM=53;BGTOTAL=1211;BGMM=0
20 Chromosome      1310588 .       G       T       40.0000 .       DP=141;MM=79;BGTOTAL=1179;BGMM=1
21 Chromosome      2173362 .       CCCG    CG      1.0     .       INDEL;DP=98;SR=196
22 Chromosome      3560455 .       CC      CGC     1.0     .       INDEL;DP=110;SR=221
23 Chromosome      3945914 .       G       A       2.5453  .       DP=2;MM=2;BGTOTAL=25;BGMM=0
24 Chromosome      4296381 .       CC      CGCC    1.0     .       INDEL;DP=74;SR=146
25 151.filtered.vcf (END)
 1 $ exactSNP
 2 Version 1.6.0
 3 
 4 Usage:
 5 
 6 ./exactSNP [options] -i input -g reference_genome -o output
 7 
 8 Required arguments:
 9 
10 -i <file>  Specify name of an input file including read mapping results. The
11 [-b if BAM] format of input file can be SAM or BAM (-b needs to be specified
12 if a BAM file is provided).
13 ...
The Integrated Genomics Viewer (IGV) is a genomic browser available at the Broad Institute. It is already installed on Disc servers.
For large data volume analysis, it is recommended to run it from a visualisation server.
The Disc platform Web site give instructions on how to proceed.
IGV Web site: https://software.broadinstitute.org/software/igv/UserGuide
 1 # Create directory
 2 mkdir my_dir
 3 # change dir
 4 cd my_dir
 5 # list files
 6 $ ls
 7 # explore an ascii file
 8 $ less my_file.txt
 9 # seek help on a particular command
10 $ man ls
11 # unzip file
12 $ unzip <my_file.zip>
We have detailed variant search step with Subread.
Other technologies (RNA-Seq, Chip-Seq) can be processed using the same general analysis.

| Table of Contents | t | 
|---|---|
| Exposé | ESC | 
| Full screen slides | e | 
| Presenter View | p | 
| Source Files | s | 
| Slide Numbers | n | 
| Toggle screen blanking | b | 
| Show/hide slide context | c | 
| Notes | 2 | 
| Help | h |