Now that you’ve implemented one workflow in a bash script, the challenge is to take the skills you’ve learned and implement a different script, on your own. The workflow that we’ll use is variant calling from the chloroplast data. The end product will be a collection of variant call files (VCF format) which can be displayed in a genome browser to show you the position of your variants.
You can definitely repurpose your assembly script to carry out this workflow — you don’t have to do everything over again from scratch. The result is due in 2 weeks. What you should turn in? — the script, some kind of proof that it works on the input files, and a screenshot with a plot of your VCF results from the IGV browser.
Define the problem
- Ion Torrent sequence for 12 tomato varietal chloroplasts
- One reference genome (NC_007898)
- Map the reads to the reference
- Identify variants specific to each strain
Identify the tools
Data cleaning for this kind of study is the same as it was for assembly. After data cleaning, there is a mapping step, and then the mapping is submitted a variant caller to identify SNPs.
At the command line, his sequence of program steps is frequently accomplished with a mapper like bowtie2, and using the samtools package from the 1000 Genomes Project as variant caller. Another common step in this type of pipeline is to sort the mapping file by genomic coordinates and then use a de-duplication tool to remove likely PCR duplicates.
Step 1: Clip, trim and QC — fastx-toolkit
These steps should be the same ones that you identified for your sequence assembly pipeline. Refer to the previous tutorials and your own previous scripts!
Step 2: Map reads — bowtie2
Bowtie2 is a widely-used read mapper. It should be preinstalled on your computer. As mentioned in class, there’s a hidden step in read mapping, at least the first time around through the pipeline. You need to index the reference genome before you map reads to it. In bowtie, you’d have to run:
bowtie2-build NC_007898.fasta NC_007898
before mapping each of your reads to the reference using a bowtie-align command like:
bowtie2 -x NC_007898 --very-sensitive -U BC17_BINF6350_Summer2014_13pm.fastq -S BC17_BINF6350_Summer2014_13pm.sam
In the bowtie2-build command, the options are the name of the input fasta file, and then a basename for multiple index files that bowtie-build will create. Then in the bowtie2 command, the -x flag is followed by the name of the index, by a descriptive flag which defines how sensitive you want the mapping to be, and by input unpaired reads (-U) and output sam (-S) filenames.
Bowtie’s output to stdout will include a description of how well your reads mapped.
Both of these steps run in a minute or two on the chloroplast data set.
Step 3: Sort — samtools
The samtools package contains many utilities for manipulating files and one of those is the sort option. samtools and bamtools should be installed on your computer. In order to sort a sam file, you actually need to dump it to standard output using the samtools view command, which turns it into a binary bam file if the option -bSu is used, and pipe stdout directly into the sort command to produce a sorted output file.
If you look at the help you’ll see that samtools sort only takes a *.bam input. Why? *.bam is the binary form of *.sam, and is much faster to manipulate than the human readable file. And sorting by coordinates isn’t super-fast when data gets large. So a complete two-stage command line for this step would look like:
samtools view -bSu BC17_BINF6350_Summer2014_13pm.sam | samtools sort - BC17_BINF6350_Summer2014_13pm.sorted.bam
This step may throw a warning saying there is a missing EOF but it will still produce its sorted output.
You’ll definitely want to check out the samtools documentation. You can almost always stream data from one samtools step to the next instead of creating big files at every step, by piping standard output from one samtools command into the next command. It’s designed that way to save hard drive space.
Step 4: Variants — samtools
The final thing that you need to do is actually call your variants out of the *.bam file. First you have to index the bam file, turning it into a bai (bam index). Then you use another samtools command, mpileup, to actually produce a variant call file from the aligned sequences.
samtools index BC17_BINF6350_Summer2014_13pm.dedupe.bam
samtools mpileup -uf NC_007898.fasta BC17_BINF6350_Summer2014_13pm.dedupe.bam | bcftools view -bvcg - > BC17_BINF6350_Summer2014_13pm.out.bcf
To create a non-binary, viewable variant call file you can also use bcftools to convert your *.bcf into a *.vcf:
bcftools view BC17_BINF6350_Summer2014_13pm.out.bcf > BC17_BINF6350_Summer2014_13pm.vcf
Now open your VCF file, along with the reference genome, reference annotation, and your read mapping file (sam or bam) in the IGV browser. The browser is pretty easy to use.
Get familiar with all of these commands and test them on one of your chloroplast genome files. Once you’ve made sure that they work, turn them into an automated script.