Access and use genome track data
Genome browsers are designed to get different types of genome track data together using the common reference system of genomic coordinates. Often you’re more interested in manipulating whole sets of data, rather than just looking at one region at a time manually, so there are tools available both online and at the command line that will help you with genome track math.
The purpose of today’s lab is to get you familiar with genome browsers and track types, and experienced with executing a few common track math operations on the web or at the command line. There are questions below highlighted in red; rather than writing this assignment up like a lab report, you can simply execute the operations and answer the questions.
Part 1: Basics of the UCSC Table Browser
Use the UCSC Table Browser to download a BED file containing gene coordinates from human chromosome 1.
We can get the BED file from the Table Browser main page. The selections in the screenshot below get you human chromosome 1 from the most recent build, using the curated RefSeq version of the gene content. If you want the results to go to a file instead of the screen, put a filename in the last field on the page. Examine the menus to see what other choices you could make.
Here’s a real thing that you might have reason to do — find all the intersections between a variant call format file and some range of the organism’s reference genome. As an example, I have uploaded a file in variant call format to Canvas. That file contains all of the SNPs from my personal 23andme data. The file has been converted from its original 23andMe format to a VCF (variant call format) file.
The VCF file looks like this — columns are chromosome, position, SNP identifier, base in reference genome, alternate base (i.e. in my genome), and the rest is details that you can look up in the VCF specification:
The BED file looks like this — chromosome, start coordinate, end coordinate are the three required fields, and then 9 optional fields of description. A BED contains some of the same information as the GFF3 files that you have seen before. The BED format is described here:
Part 2: Load files into the IGV browser
IGV is a simple desktop browser that can be used to display many types of data jointly in one view. If you’re already familiar with the IGB browser from Dr. Loraine’s lab, feel free to use that one instead. Right now you will get to see a little bit of how it works with human data, but it can be made to visualize anything if you load the correct reference genome and appropriately formatted files. In a couple of weeks we will use the browser and bedtools to examine variant calls and gene expression data.
By default, IGV opens with the human genome reference loaded. Select chr1, and then try to load various datasets. Some suggestions? Go to the Table Browser and find:
- The most up to date common SNP (single nucleotide polymorphism) data from the Variation section
- The GWAS catalog — a catalog of SNPs that have been found through Genome Wide Association Studies — download it as a BED file
- A file that might give you information about transcription/expression — e.g. the Human mRNAs track
Experiment with loading in multiple tracks that map to Chromosome 1. Once you do the filtering operations below, you can also load and view the custom files that you’ve made with intersections between tracks. If you try to download some large tracks (like the “data points” track in WIG format for GC content) you may find that the number of rows you can download through the web interface is capped and you have to get the file a different way. Look in the Downloads section on the UCSC site to see what’s available to download there.
Part 3: Manipulating track data with bedtools
Bedtools is a little software utility that can do all sorts of different math between genome tracks. Bedtools is not computationally intensive, so you should be able to run it on your desktop rather than on the cluster. You can install it from homebrew with “brew install bedtools”; the documentation states that it’s also available through apt-get and other package managers, so use what works for your system to install it.
Bedtools is well documented on readthedocs and on the Quinlan lab site, so I am not going to spell every step out for you exactly. The basic syntax that you need to accomplish finding the list of VCFs that overlap a gene in the BED file you just downloaded is:
bedtools intersect -a vcffile -b bedfile > overlaps.output
What format is the output that you get?
What happens if you switch things up so that the -a file is the BED file and the -b file is the VCF?
Use your newly acquired Table Browser and bedtools skills along with basic UNIX commands like wc (word count) and grep to gather more data and answer the following questions:
Which human autosome has the highest per-base-pair representation of SNPs in the 23andMe assay?
Which SNPs in the 23andMe assay intersect with SNPs in the GWAS (Genome Wide Association Study) Catalog (in the Phenotype and Literature track category)? Do you think the number that you get from a simple intersection is correct? What happens when you include flanking sequence, either via bedtools window or by selecting different options at track download time?
bedtools intersect -b vcffile -a bedfile | grep chr1 | wc
would give you a count of lines/words/characters in the output of the bedtools intersect command, filtered so that only lines containing chr1 are counted, which is part of the information you’d need to answer the first question above.
wc is a simple tool that counts the lines, words and characters in a file. wc -l counts the lines, which will give you the number of snps in the vcf file (one entry per line), more or less.
grep, when combined with the UNIX regular expression language, is a very powerful data filtering and search tool that can pull just about every pattern of interest out of your files.
head and tail show you the beginning and end of UNIX files.
The pipe operator connects the output of one command to the input of the next.
So, if I wanted to search the first 20 lines of my file for the pattern ‘chr’ at the beginning of the line, I would type:
head -20 genome_Cynthia_Gibas_Full_20140719094157.vcf | grep '^chr'
Part 4: Manipulating alignment data with bedtools
Soon we’re going to learn to map read data to a reference genome and produce SAM and BAM alignment files. For today, I’ve made one for you for the chloroplast sequencing data that you’ve already used a couple of times for assembly and quality checking. There are many things that you can do with SAMs and BAMs to integrate them with track and genome data in bedtools.
For instance, you can calculate the coverage depth at each position in the reference genome:
bedtools genomecov -d -ibam BC30.bam -g NC_007898.fasta
This produces an output which could be graphed, but which is not, unfortunately, a recognized continuous value data format (like .wig)
And you can also calculate coverage for each feature in your reference genome:
bedtools coverage -s -abam BC30.bam -b NC_007898.gff > BC30.defcov
Which gives you an output that contains all the information for features in your GFF file that have any reads from your BAM file overlapping them. Bedtools adds four fields to the GFF feature at the end of each record:
- The number of features in A that overlapped (by at least one base pair) the B interval.
- The number of bases in B that had non-zero coverage from features in A.
- The length of the entry in B.
- The fraction of bases in B that had non-zero coverage from features in A.
With read mapping and overlaps, the positioning of the read is very important and that’s why we don’t default to simple tools like bedtools to get accurate counts of mapped reads, but it can give you a general idea of where reads are mapping.
Part 5: Problem solving with bedtools
Now that you have the hang of some basic bedtools operations, I want you to figure out how to do a couple of things on your own. Return to the questions about SNPs above, and then look at the next three questions about the chloroplast data. There is a BAM file for the BC30 sample in the Dropbox. You’ll also need NC_007898.gff, and you will need to make an NC_007898.genome file which contains a single line:
(two values separated by a tab instead of regular spaces). Do this in a text editor like vi or BBEdit.
I will give you hints about which bedtools commands you need to use for each problem. Figure out how to do each thing, and in your report show the command line(s) and the final output you produce. Identify the differences that the command produces between the input and the output by examining the files side by side.
Problem 1: get the coordinate intervals for 100 bases flanking each feature in the GFF.
Why learn how to do this? It is a super-common thing to want to get flanking DNA for features in a genome file. For instance, you could be looking for sequence upstream of a gene to create a primer, or you could be collecting all those sequence regions and clustering them to see if you can recognize a promoter sequence.
Hint: look at the bedtools flank command
Problem 2: for each feature in a file, find its nearest physical neighbors on the same strand (say, within 200 bp) in the chloroplast
Why would you want to do this? There are lots of reasons for wanting to characterize the gene neighborhood of each gene and establish neighbor relationships. For instance, in bacteria you might be interested in identifying potential operon neighbors.
Hint: look at the bedtools window command. Another hint: for clarity, you might want to filter your GFF a little bit, for instance to only look at features labeled “CDS” or “exon” — try:
cat *.gff | grep CDS > output.cds.gff
Problem 3: for each feature in the GFF, report only the features that have NO reads overlapping them in the BAM file
Why would you want to do this? To identify content of regions in the reference genome that aren’t covered by reads in the BAM. If working with RNA-seq, to quickly identify genes with no apparent expression. MANY more possibilities.
Hint: also check out bedtools window to solve this problem.
These are just a few of the things that bedtools can do. Keep it in mind for when you have queries that involve different tracks of data belonging to a common genomic reference.
Featured figure credit: http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html