BINF 6203: Expression Statistics

BINF 6203: Expression Statistics

Last week you used transcriptome and read mapping tools to get read counts for your Vibrio vulnificus expression data, with two different pipelines. Now let’s analyze the data and get the top differentially expressed genes.

Part A:  Assembly-first with Trinity and corset

This first part assumes that you were able to assemble the transcriptome using Trinity:

trinity --seqType fq --JM 10G --left CMCP6_all.pe.keep.fastq.1 --right CMCP6_all.pe.keep.fastq.2 --single CMCP6_all.se.keep.fastq --CPU 4

which should give you a trinity_out_dir and inside that directory a Trinity.fasta file containing transcript contigs. Then you’d have needed to build a bowtie2 index from the Trinity.fasta to map to:

bowtie2-build trinity_out_dir/Trinity.fasta CMCP6Trin

And then do your read mapping. I used a 20% subsample of my reads for purposes of this exercise, and each mapping took only about 15 minutes. For example:

bowtie2 -x CMCP6Trin -1 CMCP6reads-qc/CMCP6_1ASWsample.1.fq -2 CMCP6reads-qc/CMCP6_1ASWsample.2.fq > CMCP6_1ASWTrin.sam

And then I converted my SAMs to BAMs, because that’s what corset wants as input. Finally, I ran corset on my BAM files. There are two samples each in 2 groups:

corset -g 1,1,2,2 -n A1,A1,B1,B2 CMCP6_1ASWTrin.bam CMCP6_2ASWTrin.bam CMCP6_1HSTrin.bam CMCP6_2HSTrin.bam

Note: I am running the versions of these programs that I installed in early summer of 2014 so I’m running corset v1.01.

The counts file produced looks like:

corset1

Using edgeR to get top DE genes

Now in R we’ll use the edgeR package to get the top differentially expressed clusters. If you are uncertain about using R, start up R in the working directory where your counts.txt file is sitting. Just type “R” at the UNIX command line.  Then make sure the edgeR package runs by typing “library(edgeR)” at the R prompt.  Then type the exact sequence of R commands shown below to get your top differentially expressed genes.

corset2

You can see in the output the log fold change values and p-values which tell you if a gene is significantly differentially expressed.

There’s only one problem with this output. It’s all in terms of “Clusters”, not in terms of gene names or sequences we might know. So to figure out what these genes actually are, you’ll have to go to back to the source of the clusters. First, look in the corset Clusters.txt file — search for Cluster-209.366, for example.

In that file, Cluster-209.366 refers back to comp940_c3_seq1. So then you go back to your Trinity.fasta file, search for comp940_c3_seq1, and get its sequence:

TACATTGATGAACACAGCTTTGATTTGCAGCGCTACTTTGCCACGCTCTGTTTGGTCTAT
GGCAGCAATCCTGAGCGGCATCCCAACCTACTTGATGAAATAGAAAAAGAGTACCGCCAA
GAGCGCACTGAGTTTTGCCAAGCGAACTATCAAACCCTCAGTGACAACTGGCATCGCTAT
CTCACTAGCGCGCCTGAATAAACCTAACTAACCATAATCAATCGGATGCACCATTCATAG
TGTCAAAAAATGATGAGAAACAAAATAATAACAATCTCATCATTTTTATTTCATAAAAC

Run a nucleotide BLAST of that sequence against Vibrio vulnificus reference sequences at NCBI:

corset3

Find the sequence match:

corset4

And look at what gene is at that position in V. vulnificus CMCP6:

corset5

(Note: is this a process you can write a script to do programmatically? Absolutely. Are we going to learn to do that here today? No.) You’ll notice when you get your differentials from the mapping-first pipeline, they will be labeled with identifiers that start with VV*, so that you can match up and see if they correspond with the Top DE genes you got from Corset.

 Use edgeR to get Top DE genes from the featurecounts output

Your featurecounts output file looks a little different than the counts.txt file from corset. It has a few more columns. If your R-fu is mighty, you should be able to figure out how to use the columns that you want and not use the others. If your R-fu is not strong, you may wish to edit your featurecounts output so that it has one column with the gene IDs, four columns with the counts, and no other columns or extra header lines. You can even do this in Excel and write it back out as a tab-delimited file if you don’t know how to strip out extra rows and columns in UNIX. Once you get the file reshaped so it looks like the Corset output (or get your data.frame manipulated appropriately inside R) you can get Top DE genes from your mapping-first pipeline by following exactly the same sequence of commands you just did for the Corset output.

The big question is, are your Top DE genes the same from each pipeline? Do the top 10 lists bear any resemblance to each other at all?

Comments are closed.