BINF 6215: Basic variant calling in Galaxy
Remember the chloroplast variant calling tutorial? Turns out, you can implement the same thing in Galaxy. (Sort of).
Since the chloroplast files are very small, I’m recommending you do this on Galaxy Main, because some of the tools in this sequence mysteriously did not work in my own Galaxy, and those are not always issues that can be fixed in real time.
Identify the tools
If you think back to the tools involved in our simple variant calling pipeline, the major steps are data cleaning, then a mapping step, then removing duplicates in the mapping, and then the mapping is submitted to the pileup tool for variant calling. However, we also had to do some “glue” steps that were pure data manipulation, to get files into the right format for the subsequent step.
- some kind of sequence quality trimmer
- a mapper (bowtie2, bwa, novoalign)
- sam to bam
- picard mark duplicates
- samtools index
- samtools mpileup
- bedtools to compare variant lists
Now you need to check your Galaxy to see which of these tools are installed and, if a step is not available exactly as you carried it out on the command line, whether something else can be substituted, or whether (as is true in some cases, like producing sorted SAM/BAM files) the Galaxy interface is silently taking care of a step for you.
This is a pretty bare-bones set of instructions and is not the only possible variant calling workflow in Galaxy. I encourage you to explore your options and make your own. The workflow below is what I captured after testing out various steps and deleting unsatisfactory outputs.
Input and filtering steps
You’re going to have three inputs (as you’re testing this out on just one of your chloroplast data sets to start with). The fastq file, your reference fasta file, and your reference GFF with gene regions. The GFF file is just going to get converted to BED format so that you can use the BED tools to make intersections later. The FASTQ file needs to get trimmed and filtered as usual.
Mapping step and sam manipulation
Then you can use Bowtie2 to align. You should play around with the advanced settings before settling on a workflow to run on all of your chloroplast Fastq files. See what happens when you use the very sensitive instead of sensitive mapping (something I’ve seen recommended for Ion Torrent data) and how you can change the threshholds for accepting a mapping. See what impact that has on your Bowtie2 output in terms of reads mapped and unmapped. This should run pretty fast on the server or on your own Galaxy.
Pileup and variant calling
There are actually two tools you can use here — Mpileup and the Naive Variant Caller. On the data set that I was testing this on, I was actually able to see variant sites with the NVC that I did not detect with MPileup (at least with Galaxy’s default settings) and the MPileup results were also apparently different from what I produced running this on the command line. So again, you’ll have to tinker. When you’re building the sequence of events manually, it will look like you only get a pileup output from the MPileup step, but in the workflow builder, you see that there are both MPileup and bcf (the binary version of a variant call file) outputs available. You can connect bcftools to that output and convert it into a vcf file to contrast with your Naive Variant Caller vcf file.
With the Naive Variant Caller you can choose to output only the sites that actually have a variant, which makes it a lot easier to look at your vcf file.
I can (theoretically) visualize VCF tracks in Trackster if they’re converted to the right format. What I did was use a couple of different versions of the VCF/BED intersection tool with the vcf files and the *.bed file derived from my genomic *.gff annotation file. This should result in a list of only the variant calls that fall within annotated genomic regions. During my little Saturday night Galaxy session, however, Trackster kept erroring out when trying to display files that I’ve never had any trouble displaying before. Try opening your tracks in Trackster to see what happens. If not, we’ve got other tools to visualize them outside of Galaxy.
VCF files are gigantic and no fun to view in their text format. If you want to see your Galaxy VCFs visualized, you can download them out of Galaxy and open them in a genome browser. You guys used IGB in BINF 6203, and here’s what VCF files look like opened in IGV (the Broad Institute’s genome viewer). These viewers have slightly different features and interfaces and it’s mostly down to personal preference for look and feel which ones you end up using. As you can see, you can also load your bam files into IGV to see what coverage looks like. You can use the bedtools in IGV to make intersections between tracks as well.