BINF 6203: Simple variant calling
This lab takes us back to the tomato chloroplast genome data again. You’ve QCed the chloroplast data, assembled it, mapped it, used it to learn genome track math with bedtools, and produced a consensus sequence from mapped reads.
From a biological perspective this data is interesting because it represents organelle genomes from a closely related set of tomato cultivars. While all of these cultivars are the same species, tomatoes as different as the German Cherry and the Cherokee Purple have obviously undergone some divergence. Here, you are trying to see where that divergence is represented in the chloroplast sequence.
Your goal in this exercise is to find out:
- How does the genomic divergence between these cultivars show up in terms of individual sequence variations in the chloroplast genome?
- Are variations between the genomes primarily SNPs, or small insertions or deletions? We’ll go over the VCF file format in class so you can see how to tell.
- How are the individual SNPs and/or indels dispersed in the chloroplast genome? Do they disproportionately affect particular genes or regions?
- Can you see any evidence of heterogeneity in the chloroplast genome sequence in these samples (i.e. more than one variant at the same site, with respect to the reference genome?)
Use the sequencing data for all 12 samples to answer this question.
In your report
Describe the workflow of tools (most of which you have already learned) that you will need to carry out this project, starting from the beginning (assuming you have raw data and need to start from square one). For these data, it may be worth exploring the results of being more strict about quality score cutoffs than you were when the goal was mapping and creating a consensus. But there will also be a trade off with throwing away too much of a data set that is already rather small.
You may need to look at manual pages for some tools that we have already learned in order to tweak how you use them to get the right answers. Be thorough in your description of methods, so that another person could reproduce your work exactly.
Use genome browser plots, tables, or other figures to summarize your findings as appropriate.
One new thing to learn before you start
You’ve already learned most of the tools that you will need to use to complete this project. However, there’s one thing that we haven’t done yet. You need to run samtools mpileup on a bam file, the same thing we did as an intermediate step in the read mapping lab, and use bcftools to produce a variant call file, which you will use directly instead of proceeding to a consensus. This means following the approach in the read mapping lab, but stopping before you produce a consensus fasta with vcfutils.pl.
samtools mpileup -uf reference.genome.fa infile.bam | bcftools call -c filename.out
The bcftools call parameter settings are where you should look if you want to change how many alternate alleles you think are possible (which might be helpful in detecting multiple forms of the chloroplast genome). The current version of bcftools has two calling methods — the multiallelic caller and the standard caller (bedtools call -c). We are looking, of course, at data from a haploid genome and comparing it to a close reference. Thus you might expect that only one variant would exist in the sample and you do not need to use the multiallelic caller. There is some evidence that chloroplast sequences from a single plant may show chimerism (i.e. chloroplasts in different cells will have slightly different sequences, just like human cells at different places in the body can show chimerism due to different somatic mutations). If you used the multiallelic caller you might be able to see such phenomena if they occurred. You can set ploidy and choose which caller to use following the options in the current version of the documentation.
HINTS: If you remember from the bedtools lab a couple of weeks ago, a variant call file in VCF format can be overlapped with a GFF feature file for the same reference genome, and you can make the overlaps show which genes have variants. BCF or VCF files can also be loaded into the genome viewers (IGB, IGV) that you have tried out before.