BINF 6203: Analysis of gene expression by RNA-seq (Part 1)
The next exercise for this year is pretty straightforward. We’re going to download RNA-Seq data for a bacterial genome, map it, quantitate it, and compare gene expression between two conditions. As usual, there’s a paper to read. And you have to go back to something that was introduced earlier in the course and remember how to get a study from the Short Read Archive.
In the paper, you’ll see that the organism of interest is Porphyromonas gingivalis. Several sequenced and finished strains are available of which one is the strain used in the study. The basis of the gene expression comparison is between a wild type strain and several mutant strains. You should compare expression between the wild type and the LuxS knockout strain, and see how closely you can replicate the results presented in the paper. There are three replicates of each strain, but if disk space is limiting, you can just compare one from each strain of interest.
Step 1. Import your reference genome sequence from NCBI.
Step 2. Download your sequence data from the Short Read Archive.
Step 3. Convert your sequence data into a form that CLC Genomics can import.
Warning: this is going to take a while, and because it’s ABI SOLID data the files are huge. So, I did it for you if you want to jump ahead to Step 4. If you have the space on your machine and want the experience of using the SRA toolkit you can do this part. The data set is large. You can convert these files to *.fastq format using the fastq-dump program in the SRA Toolkit. The simple syntax is just:
Note: if you are working on a Mac where you have control of things, you can get the Homebrew package manager installed and then just type
brew tap homebrew/science brew install sratoolkit
To get a working installation of the sra toolkit on your machine.
Step 4. Import your sequence to CLC.
OK, it took so long to prep this because I decided to do the tedious part for you. I did everything above this part for you for two replicates each from wild type and LuxS-, and wrote the files out in gzipped CLC format. They are available in the Dropbox. You can load those files rather than waiting for fastq-dump and CLC to do all the pre-processing. I’m sorry they’re still fairly large files, but such is the nature of the data.
Step 5. Trim your reads.
These are SOLID reads; the expectation is that they will be around 50 nt. You can be rather brutal in trimming them for the purpose of the exercise — I suggest discarding reads below 30 nt in length (because in the sample I examined closely, they were significantly overrepresented in the unmapped reads) and above 60. You could justify different, less conservative trim parameters on the bottom end if you really wanted to try to map the short scraps.
Step 6. Map your reads.
Toolbox > Transcriptomics Analysis > RNASeq Analysis
This is going to take a while. And you should definitely do it on the computers in the front two rows where you have access to more memory. Before you start the RNASeq analysis, you need to convert your reference genome to tracks. You will use the Gene track information during the mapping process.
First select the files that you want to map:
BE SURE YOU SELECT THE BATCH MODE BUTTON. Otherwise you will be sad when CLC combines all your samples. Then choose your reference track. Because this is a prokaryote and doesn’t have a separate mRNA track you can just use the Gene track:
Then choose your alignment policy. If you don’t have any specific ideas about why this data is weird, you can use the defaults:
Then decide how you want your expression values to be formulated. You could use raw counts, or RPKM. Discuss what you choose to do and why:
Then select where you want the results to go. Save them so you don’t have to do a separate “save” operation:
More in the next post, this one’s getting huge.