BINF 6203: Lab 8 — Simple variant calling
This lab takes us back to the tomato chloroplast genome data again. You’ve QCed the chloroplast data in Lab 2, assembled it in Lab 3, produced a consensus sequence from mapped reads in Lab 4, and calculated coverage for each feature in the genome in Lab 6. So you should feel like you know this data pretty well and can do a lot with it.
From a biological perspective this data is interesting because it represents organelle genomes from a closely related set of tomato cultivars. While all the same species, tomatoes as different as the German Cherry and the Cherokee Purple have obviously undergone some divergence.
Your goal in this exercise is to find out:
- How does the genomic divergence between these cultivars show up in terms of individual sequence variations in the chloroplast genome?
- Are variations between the genomes primarily SNPs, or small insertions or deletions? We’ll go over the VCF file format in class so you can see how to tell.
- How are the individual SNPs and/or indels dispersed in the chloroplast genome? Do they disproportionately affect particular genes or regions?
- Can you see any evidence of heterogeneity in the chloroplast genome sequence in these samples (i.e. more than one variant at the same site, with respect to the reference genome?)
Use the sequencing data for all 12 samples to answer this question.
In your report
Describe the workflow of tools (most of which you have already learned) that you will need to carry out this project. You may need to look at manual pages for some tools that we have already learned in order to tweak how you use them to get the right answers. Be thorough in your description of methods, so that another person could reproduce your work exactly.
Use genome browser plots, tables, or other figures to summarize your findings as appropriate.
One new thing to learn before you start
You’ve already learned most of the tools that you will need to use to complete this project. However, there’s one thing that we haven’t done yet. You can run samtools mpileup on a bam file, the same thing we did as an intermediate step in lab 4, and use bcftools to produce a binary variant call file (bcf) instead of a consensus. This process is outlined in a previous tutorial HERE. The bcftools view parameter settings are where you should look if you want to change how many alternate alleles you think are possible (which might be helpful in detecting multiple forms of the chloroplast genome).
HINT: If you remember from the bedtools lab a couple of weeks ago, a variant call file in VCF format can be overlapped with a GFF feature file for the same reference genome, and you can make the overlaps show which genes have variants.