To get around the problems with the software installs in the lab (and that some of you are having installing corset on your own) here is another way to get a full DE gene list for your Vibrio vulnificus data.
I added the sampled (20%) fastqs to the Dropbox so you can use those.
You can analyze the data with Galaxy. You’ll need to create a user account there. The size of the sampled dataset is within the disk limits that you’ll have on the public Galaxy system. Upload the data by clicking the “Get Data” link in the left menu and then choosing “Upload File from your computer”. Drag and drop your sequence files into the resulting menu like so. Make sure that you identify them as fastq files and attach them to the Vibrio vulnificus CMCP6 reference genome.
Next, under “NGS: QC and Manipulation” you need to choose the fastq Groomer to get your quality scores into the right format for Galaxy. You can choose the “multiple files” option and select it to run all datasets at once.
Now that your fastqs are groomed, you can pull them right into TopHat. Now, TopHat, as I mentioned in class, is totally overkill for this dataset because it’s really shooting for interpretation of multiple transcript isoforms (a eukaryotic problem) instead of single-exon genes and the possibility of polycistronic transcripts (a prokaryotic problem). But the literature searching I’ve done on the topic so far really suggests that there aren’t as many transcriptomic tools oriented to interpretation of microbial transcripts as there are for eukaryotes (unless it’s interpretation out of metatranscriptomic data sets) and so we are going to have to assume for now that TopHat will give us reasonable interpretation of transcripts for our single-exon genes even though it is not ideal. Over the summer I plan to do a more serious lit search and tutorial for what’s new in prokaryotic transcriptomics, so check back.
Anyway, on to TopHat:
First you need a reference genome. Load the fasta, GFF3 and GTF files for V.v. CMCP6 into your Galaxy.
Then make a custom build. In the User menu in the top black bar, select custom build and use your FASTA file as the input. Name it something memorable, like vvcmcp6.
You’re going to run TopHat on the datasets that are the output of the fastq groomer. TopHat and the other pipeline tools are in the NGS: RNA-Seq submenu within the big left-side tool menu. Each of these datasets is the “child” of one of your original paired-end files, and so you are going to have to maintain a mental “traceback” of which dataset is which. This is my least favorite aspect of the Galaxy interface (sorry Galaxy guys) — there is no straightforward way to label the child datasets to help you keep track of anything except the numbers that Galaxy assigns. At the beginning of an analysis, this isn’t too hard to keep track of, but as you get further downstream from your original data it’s a bit of a PITA. In the example below, “Data 11” traces back to the 1 reads from the 1ASW sample, and “Data 12” traces back to the 2 reads. You’ll want to run TopHat with Data 11 as the forward and Data 12 as the reverse.
You’ll need to run TopHat four times, making sure you pick the correctly paired datasets each time. The mean inner distance for this data set happens to be 400 and the standard deviation is 35.
The way I loaded my data, I had to pair up:
- Data 11 and Data 12 (1ASW) –> becomes Data 26: TopHat accepted hits (also produces four other files)
- Data 17 and Data 15 (2ASW) –> becomes Data 31
- Data 13 and Data 14 (1HS) –> becomes Data 36
- Data 18 and Data 16 (2HS) –> becomes Data 41
I did it exactly in that order so that my next-stage child outputs would be in a sensible order. If you click on the datasets produced, you can see which one is the BAM file and view what is in the other outputs.
If you inspect the other files, you’ll see that despite there not being the potential for splicing in this data set, TopHat is defining some splice junctions in the collected data. It would be interesting to dig into those and find out what scenarios TopHat is perceiving as splices in bacterial data. But we’re going to roll with it for now.
The next thing to do is to take your TopHat BAM files into Cuffdiff. That’s the differential expression analysis part of the CuffLinks package. You’ll use all four samples for this run. In this case I’m pairing the ASW samples (Data 26 and Data 31) and the HS samples (Data 36 and Data 41).
You can use classic FPKM (RPKM), geometric, or quartile normalization in CuffDiff. Remember that quartile normalization has been shown to help compensate for GC-bias in one of the studies I overviewed in lecture. There are several other options relating to how you handle multi-reads and whether you correct for other biases (that one requires additional input that we don’t have, so say “no”.
CuffDiff will give you several different outputs of DE summarized at the CDS, gene and transcript levels. You may want to use the CDS or gene DE results in this case since we should not have multiple isoforms from the same coding gene in this dataset. There is also a minimum number of reads that have to be mapped for a gene to show up in the analysis. (The default is 10). This eliminates very low-copy-number cases. The differential expression output tables look like this:
As you can see fold change and p-value, your two major descriptors that will help sort out what is significant, are present in the table. You could download the data and pull those values into R or even Excel and create a volcano plot of significance vs. effect. You could also sort the DE gene list first by fold change and within that subset by p-value, and pull top DE gene ids into Ontologizer, as I have done in these older tutorials. (note: YMMV, I have not revisited these lately to make sure they work for newest versions)