BINF 6203: Variant Detection with CLC Genomics
This last action question is not difficult or time consuming to run. You will use the chloroplast data files that we used for the earlier exercises.
The goal is to be able to identify where there are variant regions in the different kinds of tomato chloroplasts relative to the reference genome. Remember that the sample ending in *29 is the sequencing that was done on the Heinz 1706-BG tomato which is the reference. You’ll want to compare the sequencing results for the reference to the published reference genome and see if there are any variants between the chloroplasts isolated at UNCC, and the original reference. Then pick three other tomato cultivars to analyze and compare. I suggest using the data sets with the most sequence, because depth of coverage is one criterion for whether the program will call a variant or not.
The steps in the process are:
1. Map a set of reads (NGS Core Tools/Map Reads to Reference)
Create a “stand alone read mapping” — that will save your mapping with all the read positions as a new CLC object. Also generate reports so you can see how many reads map, etc.
2. Detect variants (using Resequencing Tools/Quality-Based Variant Detection)
This method focuses on the quality of the sequences surrounding each site when attempting to call variants. Variants in low quality regions won’t included in counts for variant calling, so you want to pay attention to what the threshholds are for calling — the neighborhood is 5 bases on either side, the maximum number of mismatches is 2, and quality at the site must be at least 20 while the average neighborhood quality should be 15. Given the overall quality of your Ion Torrent reads it is OK to use these thresholds.
You can avoid calling variants in regions with low coverage, and you can set a threshold for how many times the variant has to occur in the sequences to be included (by default, at least 35% of the reads mapped at that position must show the variant relative to the reference). Remember in this case that the chloroplast genome is haploid so your ploidy number should be 1.
3. Save your mapping
4. Detect variants (using Resequencing Tools/Probabalistic Variant Detection)
You can also try out the probabilistic variant detection method, which uses the observed data in the absence of a reference. The method is described here. For this data, and because you are mostly focused on finding differences between the reads and the reference, it will probably not produce much different results from the quality-based variant detection method. However, if you suspected that variant sites in the chloroplast might have more than one form in the same sample (i.e. there are is some chimerism involved, with some of the cells in the plant having slightly variant chloroplasts) you could set the ploidy number higher than 1 and see what changes.
5. Save this alternate mapping and compare
You can compare your mapping results, as well as comparing them to tracks from the published chloroplast reference genome, by creating a track list that includes multiple mapping results tracks as well as annotation tracks. Use the annotation tracks to help figure out where the SNPs fall relative to genes and other features. You can also look at the SNP table that is generated for detailed information on individual SNPs and InDels.
6. Repeat for other genomes
There are more tools that we could use to look at differences among these genomes, but because they are so recently diverged we are less likely to find chromosomal breakpoints and large scale rearrangements. For your report, focus on the number and location of SNPs detected, their relationship to coding or noncoding features, and whether variable regions are consistent across genomes.