BINF 6203: Genome track math
You can think of genomic data formats in a few main categories:
- Sequence data (FASTA, FASTQ)
- Alignments (BLAST results, SAM and BAM files)
- Track data (genomic intervals — coordinate range tracks, WIG files — continuous value tracks)
- Tabular data (many kinds — interval data is a subset)
Genome browsers are designed to get different types of data together using a common reference system of genomic coordinates and display them together. Genome browsers are somewhat less good at integrating data so that you can query it, but there are command line tools and other types of systems that can help with that.
The purpose of today’s lab is to get you familiar with track types and experienced with executing a few common track math operations in the UCSC browser and with bedtools.
Basics of bedtools
Use the UCSC Table Browser to download a BED file containing gene coordinates from human chromosome 1. Most of you said you’d used the UCSC browser before but if you don’t know how to do this you can run through the Table Browser User Guide.
Here’s a real thing that you might have reason to do — find all the intersections between a variant call format file and some range of the organism’s reference genome. As an example, I have uploaded a file in variant call format to Moodle. That file contains all of the SNPs from my personal 23andme data. The file has been converted from its original 23andMe format to a VCF (variant call format) file. Most browsers can handle genes in BED format and variation data in VCF format so these are typical types of files that you might want to work with.
The VCF file looks like this — columns are chromosome, position, SNP identifier, base in reference genome, alternate base (i.e. in my genome), and the rest is details that you can look up in the VCF specification:
The BED file looks like this — chromosome, start coordinate, end coordinate are the three required fields, and then 9 optional fields of description. A BED contains some of the same information as the GFF3 files that you have seen before. The BED format is described here:
Bedtools is a little software utility that can do all sorts of different math between genome tracks. The basic syntax that you need to accomplish finding the list of VCFs that overlap a gene in the BED file you just downloaded is:
bedtools intersect -a vcffile -b bedfile > overlaps.vcf
What happens if you switch things up so that the -a file is the BED file and the -b file is the VCF?
Use your newly acquired bedtools skills along with basic UNIX commands like wc (word count) and grep to answer the following question: which human autosome has the highest per-base-pair representation of SNPs in the 23andMe assay?
bedtools intersect -b vcffile -a bedfile | grep chr1 | wc
would give you a count of lines/words/characters in the output of the bedtools intersect command, filtered so that only lines containing chr1 are counted, which is part of the information you’d need to answer the question above.
Part 3: Manipulating alignment data with bedtools
Last week you created SAM and BAM alignment files for the chloroplast genome data, viewed them in the IGV browser, and saved them as consensus files. There are other things that you can do with SAMs and BAMs to integrate them with track and genome data in bedtools.
For instance, you can calculate the coverage depth at each position in the reference genome:
bedtools genomecov -d -ibam BC30.bam -g NC_007898.fasta
This produces an output which could be graphed, but which is not, unfortunately, a recognized continuous value data format (like .wig)
And you can also calculate coverage for each feature in your reference genome:
bedtools coverage -s -abam BC30.bam -b NC_007898.gff > BC30.defcov
Which gives you an output that contains all the information for features in your GFF file that have any reads from your BAM file overlapping them. Bedtools adds four fields to the GFF feature at the end of each record:
- The number of features in A that overlapped (by at least one base pair) the B interval.
- The number of bases in B that had non-zero coverage from features in A.
- The length of the entry in B.
- The fraction of bases in B that had non-zero coverage from features in A.
With read mapping and overlaps, the positioning of the read is very important and that’s why we don’t default to simple tools like bedtools to get accurate counts of mapped reads, but it can give you a general idea of where reads are mapping.
Part 4: Problem solving with bedtools
Now that you have the hang of some basic bedtools operations, I want you to figure out how to do a couple of things on your own with the chloroplast data. You should have the sorted BAM file you created for one of your chloroplast samples in the last lab. You’ll also need NC_007898.gff, and you will need to make an NC_007898.genome file which contains a single line:
(two values separated by a tab instead of regular spaces)
I will give you hints about which bedtools commands you need to use for each problem. Figure out how to do each thing, and in your report show the command line(s) and the final output you produce. Explain the contents of the output.
Problem 1: get the coordinate intervals for 100 bases flanking each feature in the GFF.
Why learn how to do this? It is a super-common thing to want to get flanking DNA for features in a genome file. For instance, you could be looking for sequence upstream of a gene to create a primer, or you could be collecting all those sequence regions and clustering them to see if you can recognize a promoter sequence.
Hint: look at the bedtools flank command
Problem 2: for each feature in a file, find its nearest physical neighbors on the same strand (say, within 200 bp) in the chloroplast
Why would you want to do this? There are lots of reasons for wanting to characterize the gene neighborhood of each gene and establish neighbor relationships. For instance, in bacteria you might be interested in identifying potential operon neighbors.
Hint: look at the bedtools window command. Another hint: for clarity, you might want to filter your GFF a little bit, for instance to only look at features labeled “CDS” or “exon” — try:
cat *.gff | grep CDS > output.cds.gff
Problem 3: for each feature in the GFF, report only the features that have NO reads overlapping them in the BAM file
Why would you want to do this? To identify content of regions in the reference genome that aren’t covered by reads in the BAM. If working with RNA-seq, to quickly identify genes with no apparent expression. MANY more possibilities.
Hint: also check out bedtools window to solve this problem.
These are just a few of the things that bedtools can do. Keep it in mind for when you have queries that involve different tracks of data belonging to a common genomic reference.
Featured figure credit: http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html