BINF 6215: Trinity and Corset at the command line
This tutorial is my version of the workflow for analysis of the Synechocystis PCC6803 gene expression data using Trinity and Corset. Disclaimer: walking through the workflow shows that there is plenty to be skeptical about in this dataset or at least in the stat and exp samples. So we should definitely talk about what to look for.
Define the problem
To start off with, let’s just pretend that this Synechocystis data set has no reference genome. How will we determine where the reads belong and provide some kind of frame of reference to map them to, in order to get counts and make comparison between the conditions? There are several programs that provide reference-free assembly of transcriptomes for this purpose.
We are going to use digital normalization in our example to prepare the data for assembly in Trinity, and then once the transcriptome is assembled, we’ll map the full data sets back to the transcriptome to get read counts. Digital normalization flattens out the data in a way that is helpful for assembly, but not so helpful if you’re looking to count abundance of reads at different transcriptome locations.
Functionally, fastq-dump converts the reads to the correct format, Trimmomatic performs standard quality trims, khmer and seqtk are different methods for reducing the size of your data if needed, Trinity assembles the reads into a transcriptome FASTA file, bowtie creates an index out of the transcriptome and maps the reads against it, and Corset clusters contigs that should be categorized together, and converts mapped reads into read counts for differential analysis.
I’ve precomputed the khmer normalizations on your data just to save time. They’re in the Dropbox along with the full-sized fastq files. fastq-dump, Trimmomatic and Bowtie2 are already installed on your computer. Trinity may have been installed, via homebrew, in /usr/local/Cellar, but I was also easily able to compile an up-to-date version in my own Applications folder. Likewise with Corset, which just needed a bit of help finding the samtools headers that it needs to compile. There is a differential analysis step at the end that you can do outside your workflow and in EdgeR; the Bioconductor tools can be installed in your copy of RStudio. Again this pretty closely follows the Corset authors’ tutorial and they also show you how to do something fairly similar in bpipe (although without the data normalization step).
- Download Trinity
- Move the archive into your ~/Applications directory and expand
- Type “make”
That should do it. I’ve installed this on my workstation in 104 without problems.
- Download samtools and expand the archive in your ~/Applications directory. You don’t need to install this but you need to give this directory as input to the Corset install so it can use the sam.h header file.
- Download Corset
- Move the archive into your ~/Applications directory and expand
- Type “./configure –with-bam_inc=/Users/$YOURNAME/Applications/samtools-0.1.19”
- Type “make”
- Type “make install”. If your permissions do not allow you to install Corset in /usr/local, just run it by giving the path to the application in the Corset directory itself.
Step 1 — convert SRA files into fastq files
You may have already done this conversion for a previous project, and converted fastq files are already in the dropbox, but let’s see how it would be done if it were part of the script. Since this data is single end reads, you will take each *.sra input and produce a single *.fastq output. This is assuming you’re starting with the data arranged as it is when you download the whole experiment, with each *.sra file in a similarly-named directory. If not you’ll need to get your filename variables differently, as we did in previous examples.
for f in *; do
echo "File -> $f"
Step 2 — trim fastq files
You’re already familiar with Trimmomatic’s command line from the V. vulnificus assembly walk-through. Since this data is single end reads, you will take a single *.fastq input and produce a single *.fastq output.
java -jar /Users/cgibas/Applications/Trimmomatic-0.32/trimmomatic-0.32.jar SE -phred33 $f $b.trimmo.fastq ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Step 3 — normalize reads with khmer
The recommended strategy for digitally normalizing RNA-Seq data prior to transcriptome assembly is to do a single pass of normalization to a kmer coverage of 20:
normalize-by-median.py -k 20 -C 20 --n_tables 4 --min-tablesize 2e9 -o $b.khmer.fastq $f
For class purposes, the first three steps of the pipeline have been run for you, and you can simply pick up running and testing your pipeline from the step below. *.khmer.fastq files for each original fastq are in the dropbox.
Step 4 — pool khmer fastq files for assembly with Trinity
When assembling a transcriptome, it makes sense to pool your reads, because there will be transcripts that have fewer fragments to assemble from one condition than from others. Then the consensus assembly can be used as a reference for the individual samples. Pooling your files can simply be done at the UNIX command line,
Step 5 — assemble pooled sequences with Trinity
What is Trinity doing? In a nutshell: unlike CuffLinks, Trinity is an assembly-first (de novo) approach to assembling the transcriptome. It has three components — Inchworm, which uses a greedy algorithm to rapidly assemble contigs, Chrysalis, which builds deBruijn graphs for contigs, and Butterfly, which resolves alternative splicings and paralogous sequences which may contribute to greedily-assembled contigs. At the end of the run, you’ll be left with, among other things, a file of FASTA sequences for contigs which can be treated as a reference genome.
I made an alias for Trinity in my ~/.bash_profile, so I don’t have to put the full path of the command in my script. The alias is:
The command that needs to go in my script with this alias active is just:
trinity --seqType fq --JM 10G --single merged.khmer.fastq
For the purposes of preparing this exercise, I tested two merged inputs. One including just the stationary and exponential phase fastqs, and one contained normalized reads from all the conditions. The first took just about 50 minutes on my 16GB RAM MacBook Pro. The second is probably unfeasible.
Tip: to find out how long something takes to run, run it with the UNIX time utility, like so:
time trinity --seqType fq --JM 10G --single merged.khmer.fastq
When your run completes, you’ll get a message like:
The big data set took
Step 6 — Build a Bowtie2 index from the assembled contigs
This should be a familiar command by now. Trinity is going to leave you with a standard set of files in a Trinity-out-dir unless you specify otherwise or rename it. So your contigs will just be called Trinity.fasta and you could just call your index Trinity if you don’t have more than one run to keep track of.
bowtie2-build Trinity.fasta Trinity
Step 7 — Map reads to assembled transcriptome reference
This is where you want to switch to using the complete read sets in the analysis instead of just the digitally normalized data. So you built your assembly from, for example, stat_phase.khmer.fastq and exp_phase.khmer.fastq, but you’d map the reads in stat_phase.fastq and exp_phase.fastq to the assembly.
bowtie2 -x Trinity -U stat_phase.fastq -S stat_phase.sam
bowtie2 -x Trinity -U exp_phase.fastq -S exp_phase.sam
Step 8 — Cluster mapped reads and contigs with Corset
So, if Trinity is producing the reference transcriptome assembly and Bowtie2 is doing the read mapping, what is Corset doing? In a nutshell, again — transcriptome assembly produces contigs, but each contig may not uniquely represent a single transcript. Corset considers multiple samples and conditions, and clusters contigs and their associated reads to ensure that when mapped reads are counted up, apples are being compared to apples. It is somewhat analogous to the idea of clustering different sequence forms of genes and their associated ESTs to make UniGenes.
Corset runs on *.bam files, so prior to starting your run you need to convert them using samtools.
samtools view -S -b stat_phase.sam > stat_phase.bam
The Corset command line takes a list of groups (e.g. stationary vs. exponential) and a list of filenames belonging to each group. So if you had just one replicate of each of two groups,
corset -g 1,2 -n stat_phase,exp_phase *.bam
Of course that doesn’t really properly take advantage of the method. For learning purposes, you can create simulated replicates out of your data. Just use seqtk as described in the taming Illumina datasets tutorial. Produce three samples out of your data, using 20% (or even less, while you’re testing your script) for each sample. Be sure you use a different random seed on the seqtk command line each time so that your three samples are not identical.
~/Applications/seqtk/seqtk sample -s100 stat_phase.fastq 0.2 > stat_phaseS1.fastq
Step 9 — Complete analysis with EdgeR
On the Corset Wiki example page, the authors show you a code snippet which you can adapt to take your data into the EdgeR package for differential analysis. Once you make the Trinity/Corset workflow into a script, take your outputs one step further to get a DE gene list.
What’s next/what’s missing?