BINF 6203: Expression quantitation with featurecounts
In this lab and the next, we are going to use two different methods to calculate differential expression for the same RNASeq dataset. Note: there is no standalone lab writeup due for the featurecounts part. Instead you’ll do one writeup for this part plus the differential expression analysis with DeSeq.
In a nutshell, we have measured gene expression under two conditions (two replicates each) and we want to find out which genes are the most significantly differentially expressed between the two conditions. We are concerned with two things — magnitude of differential expression, and significance of differential expression. We want to see only genes with both high significance (p value definitely less than 0.05) and large effect size (2-fold change is a common threshold).
Transcript data is handled differently depending on whether or not you have a reference genome to which to map the samples. If you do, you use a pipeline like:
bowtie2 (mapping) –> featurecounts (what it says) –> edgeR (normalization and differential analysis)
If you have no reference genome, you use a pipeline like:
Trinity (transcript assembly) — corset (clustering and counting) –> edgeR (normalization and differential analysis).
The data set that we have is one that was generated by our group in collaboration with Dr. Oliver’s group in Biology. The first paper about this data set was published here. The pipeline you will use here is based on this work.
We will contrast the gene expression of Vibrio vulnificus CMCP6 in two conditions — artificial seawater and human serum. I have pre-trimmed the data for you and for each sample there is an interleaved PE (both reads in the same file) and an SE file. The files can be found in Dropbox. The full data set is also available in the NCBI short read archive.
The first step that you would need to do is to use Trimmomatic to clean up your paired reads. We saw how to do this back in Lab 2. I’ve done this for you, though. I used a similar set of thresholds as we did for the Illumina genome data for assembly.
The next step is to use bowtie2 to map each of the four samples to the reference genome of V. vulnificus CMCP6. We learned how to do this in Lab 4. I have included the reference genome fasta (and the matching GTF annotation file from EMBL, which featurecounts will need to create per-gene read counts) in the Dropbox.
If you’re curious and have the time, you could also try doing this with the subread aligner in the subread package, and even check to see if using subread vs. bowtie makes a difference to your results. Or work with a friend and one of you do it one way and one do it the other way and compare.
Once you have your mapped reads, you’ll need to combine your output SAM or BAM file with the genome sequence and the GTF file information about feature locations, to produce a file containing counts per gene. To do this, we’re going to use the featurecounts program from the subread package. This package can be installed either as an R package or as a command-line program. I have had the best luck using the command line version. You should be able to install this on your machine as you have for some of the other tools this semester.
Once subread is installed, you will only need to use the featurecounts program (unless you’re trying out the subread aligner, see above). You absolutely have to have a matching genome file and GTF file for this or featurecounts is going to give you empty results. While NCBI provides bacterial genomes as separate chromosome FASTA files, EMBL has the option of downloading them as concatenated FASTAs with the corresponding annotation contained in a single GTF. The files you need from the Dropbox are:
Featurecounts will automatically detect whether you have a SAM or a BAM file. A basic featurecounts command to summarize the content of a single BAM is:
featurecounts -a Vibrio_vulnificus_cmcp6.GCA_000039765.1.23.gtf -o CMCP6_1ASW.counts CMCP6_1ASW.bam
If you wanted to summarize BAM files for all four samples at once, which will give you one file that is easier to convert into an R data frame than four separate files, you would use the command:
featurecounts -a Vibrio_vulnificus_cmcp6.GCA_000039765.1.23.gtf -o CMCP6.counts CMCP6_1ASW.bam CMCP6_2ASW.bam CMCP6_1HS.bam CMCP6_2HS.bam
As featurecounts is running it will produce output that looks like this:
When bowtie runs, it tells you how many reads map to the genome. When featurecounts runs, it tells you how many OF THOSE MAPPED READS map to features, based on some parameters you specify. Even when a read maps to the genome it may not necessarily fall inside an annotated feature. This can be useful information.
There are many ways to refine the output of featurecounts. For instance, you could count the results in terms of fragments rather than just in terms of reads. That means you’re counting every PAIR of reads as representing one template i.e. one piece of captured transcript (which would also make sense for this data set). You could also decide to count reads only if both reads of a pair map to the same chromosome (sensible). Take a look at the featurecounts options and choose those that you think would make sense based on what you know of this data from the published paper. Featurecounts runs fast, so you can take different approaches and compare the results fairly easily.