BINF 6203: Gene expression read quantitation
In this lab and the next, we are going to use two different methods to calculate differential expression for the same RNASeq dataset. In a nutshell, we have measured gene expression under two conditions (two replicates each) and we want to find out which genes are the most significantly differentially expressed between the two conditions. We are concerned with two things — magnitude of differential expression, and significance of differential expression. We want to see only genes with both high significance (p value definitely less than 0.05) and large effect size (2-fold change is a common threshold).
Transcript data is handled differently depending on whether or not you have a reference genome to which to map the samples. If you do, you use a pipeline like:
bowtie2 (mapping) –> featurecounts (what it says) –> edgeR (normalization and differential analysis)
If you have no reference genome, you use a pipeline like:
Trinity (transcript assembly) — corset (clustering and counting) –> edgeR (normalization and differential analysis).
The data set that we have is one that was generated by our group in collaboration with Dr. Oliver’s group in Biology. The first paper about this data set was published here. The pipeline you will use here is based on this work.
We will contrast the gene expression of Vibrio vulnificus CMCP6 in two conditions — artificial seawater and human serum. I have pre-trimmed the data for you and for each sample there is an interleaved PE and an SE file. The files can be found in Dropbox. The full data set is also available in the NCBI short read archive.
QC
The first step that you would need to do is to use Trimmomatic to clean up your paired reads. We saw how to do this back in Lab 2. I’ve done this for you, though. I used a similar set of thresholds as we did for the Illumina genome data for assembly.
Read Mapping
The next step is to use bowtie2 to map each of the four samples to the reference genome of V. vulnificus CMCP6. We learned how to do this in Lab 4. I have included the reference genome fasta (and the matching GTF annotation file from EMBL, which featurecounts will need to create per-gene read counts) in the Dropbox.
If you’re curious and have the time, you could also try doing this with the subread aligner in the subread package, and even check to see if using subread vs. bowtie makes a difference to your results. Or work with a friend and one of you do it one way and one do it the other way and compare.
Read counting
Once you have your mapped reads, you’ll need to combine your output SAM or BAM file with the genome sequence and the GTF file information about feature locations, to produce a file containing counts per gene. To do this, we’re going to use the featurecounts program from the subread package. This package can be installed either as an R package or as a command-line program. I have had the best luck using the command line version. You should be able to install this on your machine as you have for some of the other tools this semester.
Once subread is installed, you will only need to use the featurecounts program (unless you’re testing the subread aligner, see above). You absolutely have to have a matching genome file and GTF file for this or featurecounts is going to give you empty results. While NCBI provides bacterial genomes as separate chromosome FASTA files, EMBL has the option of downloading them as concatenated FASTAs with the corresponding annotation contained in a single GTF. The two files you need from the Dropbox are:
Vibrio_vulnificus_cmcp6.GCA_000039765.1.23.dna.genome.fa
Vibrio_vulnificus_cmcp6.GCA_000039765.1.23.gtf
Featurecounts will automatically detect whether you have a SAM or a BAM file. A basic featurecounts command to summarize the content of a single BAM is:
featurecounts -a Vibrio_vulnificus_cmcp6.GCA_000039765.1.23.gtf -o CMCP6_1ASW.counts CMCP6_1ASW.bam
If you wanted to summarize BAM files for all four samples at once, which will give you one file that is easier to convert into an R data frame than four separate files, you would use the command:
featurecounts -a Vibrio_vulnificus_cmcp6.GCA_000039765.1.23.gtf -o CMCP6.counts CMCP6_1ASW.bam CMCP6_2ASW.bam CMCP6_1HS.bam CMCP6_2HS.bam
As featurecounts is running it will produce output that looks like this:
When bowtie runs, it tells you how many reads map to the genome. When featurecounts runs, it tells you how many OF THOSE MAPPED READS map to features, based on some parameters you specify. Even when a read maps to the genome it may not necessarily fall inside an annotated feature. This can be useful information.
There are many ways to refine the output of featurecounts. For instance, you could count the results in terms of fragments rather than just in terms of reads. That means you’re counting every PAIR of reads as representing one template i.e. one piece of captured transcript (which would also make sense for this data set). You could also decide to count reads only if both reads of a pair map to the same chromosome (sensible). Take a look at the featurecounts options and choose those that you think would make sense based on what you know of this data from the published paper. Featurecounts runs fast, so you can take different approaches and compare the results fairly easily.
Counting when there’s no reference
What if we don’t have a reference genome? Let’s just pretend that this data set lacks reference data. How will we determine where the reads belong and provide some kind of frame of reference to map them to, in order to get counts and make comparison between the conditions? There are several programs that provide reference-free assembly of transcriptomes for this purpose.
Assemble the data in Trinity, and then once the transcriptome is assembled, we’ll map the full data sets back to the transcriptome to get read counts. This pretty closely follows the Corset authors’ tutorial. Both of these programs should have been pre-installed on the lab machines by tech support, but as usual you may need to install them yourselves.
Install Trinity
- Download Trinity
- Move the archive into your ~/Applications directory and expand
- Type “make”
That should do it. I’ve installed this on a workstation in 104 without problems.
Install Corset
- If you have not already done so, get samtools and expand the archive in your ~/Applications directory. You don’t need to install this but you need to give this directory as input to the Corset install so it can use the sam.h header file.
- Download Corset
- Move the archive into your ~/Applications directory and expand
- Type “./configure –with-bam_inc=/Users/$YOURNAME/Applications/samtools-0.1.19”
- Type “make”
- Type “make install”. If your permissions do not allow you to install Corset in /usr/local, just run it by giving the path to the application in the Corset directory itself.
Assemble pooled sequences with Trinity
What is Trinity doing? In a nutshell: unlike the previous pipeline, Trinity is an assembly-first (de novo) approach to assembling the transcriptome. It has three components — Inchworm, which uses a greedy algorithm to rapidly assemble contigs, Chrysalis, which builds deBruijn graphs for contigs, and Butterfly, which resolves alternative splicings and paralogous sequences which may contribute to greedily-assembled contigs. At the end of the run, you’ll be left with, among other things, a file of FASTA sequences for contigs which can be treated as a reference genome.
I made an alias for Trinity in my ~/.bash_profile, so I don’t have to put the full path of the command in my script. The alias is:
alias trinity='/usr/local/Cellar/trinity/r20131110/Trinity.pl'
The command that needs to go in my script with this alias active is just:
trinity --seqType fq --JM 10G --mergedsequences.fastq
What should your merged input be? Well, you want to make a de novo assembled transcriptome to which all of your samples could theoretically map. So in theory you should combine all the reads in all of your four samples and assemble the combined data set. In practice you might want to use a program like seqtk to produce a sub-sample of each sample, and combine those. Trinity does use a digital normalization procedure to reduce redundancy in the dataset, but it will still run very slowly on a too-large set of reads.
Build a Bowtie2 index from the assembled contigs
This should be a familiar command by now. Trinity is going to leave you with a standard set of files in a Trinity-out-dir unless you specify otherwise or rename it. So your contigs will just be called Trinity.fasta and you could just call your index Trinity if you don’t have more than one run to keep track of.
bowtie2-build Trinity.fasta Trinity
Map reads to assembled transcriptome reference
bowtie2 -x Trinity -U CMCP6_1ASW.fastq -S CMCP6_1ASW.sam
bowtie2 -x Trinity -U CMCP6_2ASW.fastq -S CMCP6_2ASW.sam
etc. You should have one mapping for each of the four samples.
Cluster mapped reads and contigs with Corset
So, if Trinity is producing the reference transcriptome assembly and Bowtie2 is doing the read mapping, what is Corset doing? In a nutshell, again — transcriptome assembly produces contigs, but each contig may not uniquely represent a single transcript. Corset considers multiple samples and conditions, and clusters contigs and their associated reads to ensure that when mapped reads are counted up, apples are being compared to apples. It is somewhat analogous to the idea of clustering different sequence forms of genes and their associated ESTs to make UniGenes.
Corset runs on *.bam files, so prior to starting your run you need to convert them using samtools.
samtools view -S -b CMCP6_1ASW.sam > CMCP6_1ASW.bam
The Corset command line takes a list of groups (e.g. ASW vs. HS) and a list of filenames belonging to each group. So if you had just one replicate of each of two groups,
corset -g 1,2 -n ASW,HS CMCP6_1ASW.bam...etc.
Differential expression analysis
Once you have obtained your feature counts following each of these two methods, you’ll need to calculate differential expression between them. We’d like to do this using the R package EdgeR. We’ll get into how to do this next week.
Alternate approach: Tuxedo pipeline in Galaxy
If you don’t want to fool around with command line programs and RStudio (or you just want to try something different) I have an older tutorial about how to run the Tuxedo pipeline to get differential expression values in the Galaxy online workflow system. If you are hunting for some extra credit, you can try that pipeline out alongside these methods, and compare your top differentially expressed genes side by side.