BINF 6215: command line variant calling
Define the problem
- Ion Torrent sequence for 12 tomato varietal chloroplasts
- One reference genome (NC_007898)
- Map the reads to the reference
- Identify variants specific to each strain
Identify the tools
Obviously you’ve done some of this work already in the previous example. For the steps after data cleaning, that’s where we’re going to learn something new. If you think back to the CLC tools involved in variant calling, after cleaning there is a mapping step, and then the mapping is submitted to either the quality-based variant caller or the probabilistic variant caller. This sequence of program steps is frequently accomplished with a mapper like bowtie, bwa or Novoalign, and the samtools package from the 1000 Genomes Project as variant caller. Another common step in this type of pipeline is to sort the mapping file by coordinates and then use a de-duplication tool to remove likely PCR duplicates.
Step 1-4: Trim — fastx-toolkit
These steps should be the same that you identified for your sequence assembly pipeline. Refer to the previous tutorials and your own previous scripts!
Step 5: Map reads — bowtie2 or bwa
Bowtie and BWA are widely-used read mappers. One or both of these should be preinstalled on your computer. There’s a hidden step here, at least the first time around through the pipeline. Both of them require you to run an explicit step to index the reference genome before you map reads to it. For instance, in bowtie, you’d have to run:
bowtie2-build NC_007898.fasta NC_007898
before mapping each of your reads to the reference using a bowtie-align command like
bowtie2 -x NC_007898 --very-sensitive -U BC17_BINF6350_Summer2014_13pm.fastq -S BC17_BINF6350_Summer2014_13pm.sam
Bowtie’s output to stdout includes a description of how well your reads mapped.
Both of these steps run in a minute or two on the chloroplast data set.
Note: bowtie2 may actually not be the best aligner to use with Ion Torrent data. You could investigate this later by building a different aligner into your pipeline, or changing the Bowtie parameters.
Step 6: Sort — samtools
The samtools package contains many utilities for manipulating files and one of those is the sort option. samtools and bamtools should be installed on your computer. In order to sort a sam file, you actually need to dump it to standard output using the samtools view command, and pipe stdout directly into the sort command to produce a sorted output file. If you look at the help you’ll see that samtools sort only takes a *.bam input. Why? *.bam is the binary form of *.sam, and is much faster to manipulate than the human readable file. And sorting by coordinates isn’t super-fast when data gets large. So a complete two-stage command line would look like:
samtools view -bSu BC17_BINF6350_Summer2014_13pm.sam | samtools sort - BC17_BINF6350_Summer2014_13pm.sorted.bam
This step may throw a warning saying there is a missing EOF but it will still produce its sorted output. According to the SeqAnswers forum, in older versions of samtools this was a known false-positive bug; the version we have on the lab computers may be out of date.
Take a good close look at the samtools documentation, because you can stream a lot of data from one step to the next instead of creating big files at every step, by piping standard output from one samtools command into the next command. It’s designed that way to save hard drive space, but of course that means you can’t check on your intermediate files.
Step 7: Dedupe — picard
Picard is another tool that is a product of the 1000 Genome Project Data Processing Subgroup, like samtools and bamtools. It’s a Java package, so the command line will look more like the Trimmomatic commands shown in the V. vulnificus assembly example. There are a wide variety of read-handling tricks that Picard can do and this is only the beginning. It’s great if you need to sample reads out of a *.sam/*.bam rather than out of an unmapped FastQ, for example.
Picard is a collection of programs. The one that you want is MarkDuplicates.jar. To run a java program at the command line, you need to issue a java command with both java and package options, e.g.
java -Xmx1g -jar /Users/$YOURNAME/Applications/picard-tools-1.117/MarkDuplicates.jar --help
MarkDuplicates.jar requires input, output and a metrics file at minimum, but you also need to tell it whether to assume your input is sorted (true) and that you want it not to just mark the duplicates, but to remove them, and how strict or lenient to be about removal. There are some other options that you can set that are very precise, like a distance on the chip for filtering optical duplicates.
java -Xmx2g -jar /Users/cgibas/Applications/picard-tools-1.117/MarkDuplicates.jar INPUT=BC17_BINF6350_Summer2014_13pm.sorted.bam OUTPUT=BC17_BINF6350_Summer2014_13pm.dedupe.bam METRICS_FILE=out.metrics REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
Step 8: Variants — samtools
The final thing that you need to do is actually call your variants out of the *.bam file. First you have to index the bam file, turning it into a bai (bam index). Then you use mpileup to actually produce a variant call file from the aligned sequences.
samtools index BC17_BINF6350_Summer2014_13pm.dedupe.bam
samtools mpileup -uf NC_007898.fasta BC17_BINF6350_Summer2014_13pm.dedupe.bam | bcftools view -bvcg - > BC17_BINF6350_Summer2014_13pm.out.bcf
Note: If Picard has failed to work on the lab computers, you can take the output of step 6 and pass it into step 8 just to see what the workflow and output look like. If you were doing this to make real conclusions on this dataset, that dedupe step would be nice to have, because duplicate reads will make some “variants” look more important than they are.
If you want a non-binary, viewable variant call file you can use bcftools to convert your *.bcf into a *.vcf:
bcftools view BC17_BINF6350_Summer2014_13pm.out.bcf > BC17_BINF6350_Summer2014_13pm.vcf
Pipeline recap
Happens once (outside do loop)
bowtie2-build NC_007898.fasta NC_007898
Happens to each sample (inside do loop) — trim steps followed by:
bowtie2 -x NC_007898 --very-sensitive -U BC17_BINF6350_Summer2014_13pm.fastq -S BC17_BINF6350_Summer2014_13pm.sam
samtools view -bSu BC17_BINF6350_Summer2014_13pm.sam | samtools sort - BC17_BINF6350_Summer2014_13pm.sorted.bam
java -Xmx2g -jar /Users/cgibas/Applications/picard-tools-1.117/MarkDuplicates.jar INPUT=BC17_BINF6350_Summer2014_13pm.sorted.bam OUTPUT=BC17_BINF6350_Summer2014_13pm.dedupe.bam METRICS_FILE=out.metrics REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
samtools index BC17_BINF6350_Summer2014_13pm.dedupe.bam
samtools mpileup -uf NC_007898.fasta BC17_BINF6350_Summer2014_13pm.dedupe.bam | bcftools view -bvcg - > BC17_BINF6350_Summer2014_13pm.out.bcf
bcftools view BC17_BINF6350_Summer2014_13pm.out.bcf > BC17_BINF6350_Summer2014_13pm.vcf
Pipeline conversion
Again, it’s not too hard to convert our series of steps to a reusable script that will run on every file in the directory, with only the reference genome basename as input. We have not added any variables in this script that weren’t used in the last one.
The bpipe group also has a tutorial on how to convert a similar pipeline to a bpipe script.
Practice this pipeline on one of your chloroplast samples, then convert it either to a shell script or a bpipe script, your choice. Run it on all the data. Add pipeline steps to make the pipeline clean up after itself. For instance, you could delete certain intermediate files, or move them into a subdirectory that gets gzipped for archiving at the end of the script.
Note: when converting these commands to a script on MacOSX, some (especially the streamed samtools commands) had encoding problems. Try inserting
LC_ALL=C
In front of commands that are giving you trouble, if that trouble is “looking like the command is interpreting the filename character by character as a series of potential flags”.
Yes I did spend Saturday night figuring this out. Why do you ask?
Step 9: Comparison of variants
If you’ve got a Linux or Windows machine, VARB is a single purpose tool for visualization of sequence variants in the form of VCF files. You can’t really make it part of your pipeline, but you can take your outputs along with your GFF file and reference genome and open them in VARB afterwards. You can also visualize VCF tracks with the IGV tools.
Or, you can do genome math on your intervals with bedtools. Check out this little tutorial on bedtools subtract to see how bedtools math works.
Example: filtering for common SNPs
bedtools intersect -a BC17_BINF6350_Summer2014_13pm.vcf -b BC19_BINF6350_Summer2014_13pm.vcf > common_snps_BC17_BC19.vcf
Example: filtering for unique SNPs
bedtools subtract -a BC17_BINF6350_Summer2014_13pm.vcf -b common_snps_BC17_BC19.vcf > unique_BC17.vcf
Note: when converting these commands to a script on MacOSX, some (especially the streamed samtools commands) had encoding problems. Try inserting
LC_ALL=C
In front of commands that are giving you trouble, if that trouble is “looking like the command is interpreting the filename character by character as a series of potential flags”.