Ghislain Bidaut
(ghislain.bidaut@inserm.fr)
IR Cibi Platform (CRCM Integrative Bioinformatics)
SNP detection from NGS data with details on
This will be illustrated through an NGS data analysis in E. coli in order to extract variants.
Prerequisites
Plan
Command line is
As opposed to GUI (Graphics User Interface), which is
Bioinformatics usually start with sequence data from sequencing platform.
In the present case, we are using Fastq files generated by Illumina
We assume QC is done.
1 $ cd data/v-pages/raw-data/Clean/151
2 $ ls
3 total 1459888
4 -rwx------ 1 bidaut staff 344158454 13 oct 10:35 FCHL5CWBBXX_L1_ECOijqRAADFAAPEI-97_1.fq.gz
5 -rwx------ 1 bidaut staff 403297004 13 oct 10:34 FCHL5CWBBXX_L1_ECOijqRAADFAAPEI-97_2.fq.gz
Supported, published and documented all-in-one package
Liao Y, Smyth GK and Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research, 41(10):e108, 2013
Conda is a general purpose installer for Linux Mac and windows featuring a bioinformatics oriented repository (Bioconda).
It creates environnements in which you can specifies versions of programs that will live in that environment.
Conda will download needed programs from repositories and fix dependancies.
A specific bioinformatics repository exist: Bioconda.
1 $ conda config --add channels defaults
2 $ conda config --add channels conda-forge
3 $ conda config --add channels bioconda
4
5 $ conda create --name subreadalign
6 Fetching package metadata ...............
7 Solving package specifications:
8 Package plan for installation in environment /Users/bidaut/anaconda3/envs/subreadalign:
9
10 Proceed ([y]/n)? yes
11
12 #
13 # To activate this environment, use:
14 # > source activate subreadalign
15 #
16 # To deactivate an active environment, use:
17 # > source deactivate
18 #
1 $ source activate subreadalign
2 (subreadalign) 11:23:28-bidaut@manhattan-2:~$
3 $ conda list
4 # packages in environment at /Users/bidaut/anaconda3/envs/subreadalign:
5 #
6 $ conda search subread
7 Fetching package metadata ...............
8 bioconductor-rsubread 1.22.1 r3.2.2_0 bioconda
9 1.23.0 r3.3.1_0 bioconda
10 1.25.2 r3.3.1_0 bioconda
11 1.25.2 r3.3.2_0 bioconda
12 1.25.2 r3.4.1_0 bioconda
13 1.26.1 r3.4.1_0 bioconda
14 1.28.0 r3.4.1_0 bioconda
15 subread 1.5.0p3 0 bioconda
16 1.5.0 0 bioconda
17 1.5.0.post3 0 bioconda
18 1.5.2 0 bioconda
19 1.5.3 0 bioconda
20 1.5.3 1 bioconda
21 1.6.0 0 bioconda
22 1.6.0 1 bioconda
23 1.6.0 2 bioconda
1 $ conda install subread
2 Fetching package metadata ...............
3 Solving package specifications: .
4
5 Package plan for installation in environment /Users/bidaut/anaconda3/envs/subreadalign:
6
7 The following NEW packages will be INSTALLED:
8
9 subread: 1.6.0-2 bioconda
10 zlib: 1.2.11-0 conda-forge
11
12 Proceed ([y]/n)?
13 zlib-1.2.11-0. 100% |##################################################################################| Time: 0:00:00 276.52 kB/s
14 subread-1.6.0- 100% |##################################################################################| Time: 0:00:02 3.94 MB/s
1 $ conda list
2 # packages in environment at /Users/bidaut/anaconda3/envs/subreadalign:
3 #
4 subread 1.6.0 2 bioconda
5 zlib 1.2.11 0 conda-forge
6 $ subread-align -version
7
8 Subread-align v1.6.0
Subread est installé!
Ref Genome for alignment: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/dna/
! bash
$ mkdir ref-genome
$ cd ref-genome
$ wget ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/dna/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.chromosome.Chromosome.fa
Annotations used for IGV: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655
$ wget ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655
1 $ subread-buildindex ref_genome/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.chromosome.Chromosome.fa -o Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.chromosome.Chromosome.fa.index
$ cd $(MYFASTADIR)
$ ls
FCHL5CWBBXX_L1_ECOijqRAADFAAPEI-97_1.fq.gz FCHL5CWBBXX_L1_ECOijqRAADFAAPEI-97_2.fq.gz
$ subread-align -T 20 --sv -i ref_genome/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.chromosome.Chromosome.fa -t 1 -r FCHL5CWBBXX_L1_ECOijqRAADFAAPEI-97_1.fq.gz -R FCHL5CWBBXX_L1_ECOijqRAADFAAPEI-97_2.fq.gz -o 151.bam
$ ls
$ 151.bam
$
$ samtools sort 151.bam > 151.sorted.bam
$ samtools index 151.sorted.bam
$ ls
151.bam 151.sorted.bam 151.sorted.bam.bai
1 exactSNP -T 20 -f 0.3 -b -i 151.sorted.bam.bai -g ref_genome/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.chromosome.Chromosome.fa -o 151.vcf
1 $ less 151.vcf
2 ##fileformat=VCFv4.0
3 ##comment=The QUAL values for the SNPs in this VCF file are calculated as min(40, - log_10 (p_value)), where p_value is from the Fisher's Exact Test. The QUAL values for the Indels in this VCF file are always 1.0.
4 ##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
5 ##INFO=<ID=BGMM,Number=1,Type=Integer,Description="Number of mismatched bases in the background (for SNP only)">
6 ##INFO=<ID=BGTOTAL,Number=1,Type=Integer,Description="Total number of bases in the background (for SNP only)">
7 ##INFO=<ID=MM,Number=1,Type=String,Description="Number of supporting reads for each alternative allele (for SNP only)">
8 ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
9 ##INFO=<ID=SR,Number=1,Type=Integer,Description="Number of supporting reads (for INDEL only)">
10 #CHROM POS ID REF ALT QUAL FILTER INFO
11 Chromosome 26333 . C T 40.0000 . DP=93;MM=93;BGTOTAL=936;BGMM=0
12 Chromosome 363282 . G T 40.0000 . DP=206;MM=87;BGTOTAL=2058;BGMM=0
13 Chromosome 363761 . T C 40.0000 . DP=213;MM=99;BGTOTAL=2119;BGMM=0
14 Chromosome 366409 . C T 2.3222 . DP=2;MM=2;BGTOTAL=19;BGMM=0
15 Chromosome 807318 . G C 40.0000 . DP=42;MM=42;BGTOTAL=256;BGMM=0
16 Chromosome 807830 . C T 40.0000 . DP=107;MM=107;BGTOTAL=1075;BGMM=0
17 Chromosome 807856 . A G 40.0000 . DP=110;MM=110;BGTOTAL=1090;BGMM=0
18 Chromosome 1207789 . C G 40.0000 . DP=122;MM=39;BGTOTAL=1147;BGMM=0
19 Chromosome 1310570 . C T 40.0000 . DP=143;MM=53;BGTOTAL=1211;BGMM=0
20 Chromosome 1310588 . G T 40.0000 . DP=141;MM=79;BGTOTAL=1179;BGMM=1
21 Chromosome 2173362 . CCCG CG 1.0 . INDEL;DP=98;SR=196
22 Chromosome 3560455 . CC CGC 1.0 . INDEL;DP=110;SR=221
23 Chromosome 3945914 . G A 2.5453 . DP=2;MM=2;BGTOTAL=25;BGMM=0
24 Chromosome 4296381 . CC CGCC 1.0 . INDEL;DP=74;SR=146
25 151.filtered.vcf (END)
1 $ exactSNP
2 Version 1.6.0
3
4 Usage:
5
6 ./exactSNP [options] -i input -g reference_genome -o output
7
8 Required arguments:
9
10 -i <file> Specify name of an input file including read mapping results. The
11 [-b if BAM] format of input file can be SAM or BAM (-b needs to be specified
12 if a BAM file is provided).
13 ...
The Integrated Genomics Viewer (IGV) is a genomic browser available at the Broad Institute. It is already installed on Disc servers.
For large data volume analysis, it is recommended to run it from a visualisation server.
The Disc platform Web site give instructions on how to proceed.
IGV Web site: https://software.broadinstitute.org/software/igv/UserGuide
1 # Create directory
2 mkdir my_dir
3 # change dir
4 cd my_dir
5 # list files
6 $ ls
7 # explore an ascii file
8 $ less my_file.txt
9 # seek help on a particular command
10 $ man ls
11 # unzip file
12 $ unzip <my_file.zip>
We have detailed variant search step with Subread.
Other technologies (RNA-Seq, Chip-Seq) can be processed using the same general analysis.
Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |