We’re not quite ready to launch into full-on python scripting just yet — so here’s a little add-on to your final bash project, to help make use of the output files you generate.
Why it’s important? Information about genomes is organized in “tracks”. The genome itself is kind of like a ruler, a straight continuous piece of sequence with linear coordinates. The position of different kinds of features on that “ruler” is contained in files like GFF files (annotations of genes) and VCF files (locations of genetic variants). You can look for overlaps between different kinds of feature sets to find information to questions that may be valuable, like which genes contain the most variant sites in the chloroplast genome? You can evaluate this visually using a genome browser like IGV, but when you want precise answers you’ll need to use a tool like bedtools.
What you’ll turn in:
- A script that uses bedtools to extract a list of the overlaps of SNPs with genes for a chloroplast VCF file. If you’re already done with your VCF calling script, you can use your results from that as input to this next step.
- A short report (half page, single space) describing your results — that’s right, you need to decipher the output a little. Once you filter for those overlaps that have a gbkey=Gene, take a look at the following field — the “name” field. That is a gene name that should lead you to some information about the function of a gene if you search a reputable source. Try WikiGene to find some basic information about the most frequently occurring genes in your results.
Last week you learned how to make variant call files (VCF files) from your chloroplast genome sequence data. This week, we’ll add on a way to get a little more information about the meaning of these differences. The package we’re going to use is bedtools. Bedtools implements set math for sequences. It has a whole lot of functions, but the ones that you’ll start working with are subtract and intersect.
Bedtools subtract subtracts one list of segments from the other. If you’re looking at two VCF files, you’ll get a list of SNPs that are in A and not in B. In the example below, “A” is the variants from chloroplast data set 17, while “B” is a common snp set (produced using bedtools intersect) between BC17 and BC19.
bedtools subtract -a BC17_BINF6350_Summer2014_13pm.vcf -b common_snps_BC17_BC19.vcf > unique_BC17.vcf
Bedtools intersect gets the parts that are common between two lists of features. So the first command here gets the list of common SNPs between two VCF files:
bedtools intersect -a BC17_BINF6350_Summer2014_13pm.vcf -b BC19_BINF6350_Summer2014_13pm.vcf > common_snps_BC17_BC19.vcf
The second example gets the list of features in a GFF (A) that overlap with SNPs in the VCF file (B). If you reversed the -a and -b flags the output would be quite different.
bedtools intersect -b BC17_BINF6350_Summer2014_13pm.vcf -a NC_007898.gff > snpgenes.gff
Running this list through a UNIX grep filter gets only the overlaps of SNPs and features labeled “gene” in the resulting file. Recollect how many genes are in your chloroplast reference genome and see how much this reduces the list.
bedtools intersect -b BC17_BINF6350_Summer2014_13pm.vcf -a NC_007898.gff | grep gbkey=Gene > snpgenes.gff
Multiple bedtools operations can be piped together, and there are a wide variety of other bedtools commands for multi-way comparisons, getting flanking sequences, etc. You can finish the assigned script with just “intersect” though. We’ll learn to do more with bedtools later as we start working with personal genomics data.