Galaxy NGS 101: Synechocystis remix
I made a previous version of this tutorial with some actual expression data from our lab. This version uses an already-normalized set of single-end transcriptome data from Synechocystis PCC 6803. (Normalized data here). Trimming and normalization was done using Trimmomatic and khmer (single-pass), and has been discussed in other posts. The dataset size is more manageable on Galaxy Main and on our lab computers than the V. vulnificus data.
Based on conversation with Titus Brown, the normalization procedure used is likely to make this RNASeq set better for transcript assembly, but possibly unreliable for differential expression analysis, so take resulting values with a large grain of salt.
This tutorial can be done either in your own Galaxy, if you have sufficient system resources, or on Galaxy Main. In your own Galaxy, you’ll have to load some of the tools from the Tool Shed. They all work.
Galaxy data formats
You can think of Galaxy’s data formats in a few main categories:
- Sequence data (FASTA, FASTQ)
- Alignments (BLAST results, SAM and BAM files)
- Track data (genomic intervals, WIG files — continuous value tracks)
- Tabular data (many kinds — interval data is a subset)
The key thing is, though, if your data is not in one of the accepted formats you will not be able to proceed. There’s no winging it with data that’s slightly off. If you import a FASTQ file, before you can do anything else, you have to “groom” it into one particular FASTQ format, etc.
Basics of handling NGS data (Illumina example)
First, click on the cog icon at the top of the Galaxy history panel and create a new history. I named mine Synechocystis 6803 expression. This history will contain only the files relevant to this exercise so you don’t have to scroll through your old history.
Getting reference genome data
The Galaxy video tutorial on this goes through the mechanics of loading a FASTA file, but it doesn’t make entirely clear what you have to do to have complete information for your reference genome.
You’ll need a FASTA nucleotide file as well as a GFF format file containing the genomic annotations. Galaxy won’t allow you to load GenBank sequence files, which carry the annotation information along with the sequence. It will only allow bare FASTA files and you have to make a special tweak to the standard header line to get them to work properly. However, some of the specific pipelines (like the Tuxedo pipeline) that you can use in Galaxy will allow you to add a GFF format annotation file, and you can also visualize GFFs in Trackster.
To get both a FASTA nucleotide file and a GFF file for your genome, go to:
Files are labeled with their RefSeq identifiers. The Synechocystis PCC 6803 chromosome is RefSeq NC_000911. Get the files for this chromosome with file extensions *.fna (FASTA nucleotide) and*.gff (Gene feature format). The remaining sequences in the directory are four plasmid sequences that Synechocystis 6803 also carries, and being bacterial plasmids, they are likely to have some expressed genes. If you want to be completist, you can download the plasmid sequences and concatenate them onto the main chromosome sequence in a multi-fasta as described in the V. vulnificus version of this tutorial, but it’s not strictly necessary for learning purposes.
Create a “custom build”
You need to make a small edit in your FASTA file before uploading it and this is much easier done with vi than in Galaxy. The current header line in your fresh-from-Genbank genome file will look something like this:
>gi|16329170|ref|NC_000911.1| Synechocystis sp. PCC 6803 chromosome, complete genome
Galaxy needs it to look like this. With exactly no leading or trailing spaces.
Load the modified fasta file into Galaxy.
Next, go up to the User tab at the top of your Galaxy and select “Custom Builds”. Create a custom build for the Synechocystis chromosome.
Then load the *.gff file into your Galaxy. Be sure to select your newly created Custom Build as the genome to associate this data with.
Select the custom build as your genome for everything you upload for the rest of this exercise, in fact. Now you can visualize your data attached to the genome in a linear browser view using Galaxy’s Trackster, or treat it as a table in the Galaxy analysis interface.
Visualize your data
Once you have your custom build set up, you can use the Trackster application to visualize your results. Trackster is not in the Galaxy tool menu at the left. Instead, it’s accessed in the Visualization tab at the top. It’s a zoomable linear visualization that allows you to see datasets from your history mapped onto your genome, as long as they are in the correct format to be visualized, and as long as you set them up correctly to be attached to your custom build. Right now all you’ve got to visualize is the GFF file that goes along with your genome.
Click ‘Visualization’ in the top bar to get a New Visualization. Then choose the data you want to load along with your genome:
The result will look something like this. I zoomed in quite a bit and changed the track visualization style to “Pack”:
Load the transcriptome data
There are 10 Synechocystis primary transcriptome samples from different conditions. These datasets are small enough to be uploaded to Galaxy manually, but you can also use FileZilla to FTP files to the Galaxy server (as described on the Galaxy Get Data/Upload File page).
Load all of the datasets to your Galaxy at once by clicking the checkboxes on the upload page, which will automatically detect the files you have recently uploaded. Make sure you connect them to your CMCP6 custom genome build.
Groom the data
Once you load the reads, the first thing that you have to do to every file is run FastQ Groomer. All data must be converted from whatever FASTQ format it’s in to Galaxy’s favorite FASTQ format (Sanger) before any of the other tools can be used. You’re eventually going to have to run it on each one of the fastq files in your history. Try out the sequence of events for for just one or two files first. When you’re running Galaxy manually, switch between the single sheet, multiple sheet, or folder icons to select one, multiple, or a pre-defined collection of files as input. Below I’ve picked the multiple sheet icon and selected my stationary and exponential phase files only.
Check the quality
You can use FastQC from within Galaxy to see a report for each of your datasets. The FastQC output feeds into graphics, and doesn’t feed into downstream processes, so this is sort of a branch off your implied workflow.
When you open the HTML file created at the FastQC step you’ll see the familiar FastQC report. These files are pretty high quality — they’re recent Illumina reads, and I’ve already filtered them with Trimmomatic and digitally normalized them.
Once you’re satisfied with your FastQC reports, you can hide them from your Galaxy history, since you won’t be using them again. Click the checkbox in the history panel. Checkboxes will appear at the side of all your history items. Select the FastQC reports, and then select “for all selected” and choose “hide datasets”.
Normally you’d insert a trim step here, but you’re working on data that I’ve previously trimmed with Trimmomatic, so you can skip it this time around.
The next step is to map the data. The standard pipeline in Galaxy for expression analysis is the Tuxedo pipeline — mapping with TopHat, transcript construction with CuffLinks, transcript analysis with CuffCompare, and differential expression with CuffDiff. This pipeline is a little overpowered for our bacterial gene expression data, because it’s geared to understanding complex transcripts (we generally use Bowtie, featurecounts, and EdgeR for similar tasks in bacteria) but since we know for sure the Tuxedo pipeline works on main and on local, we will make do and see what kind of information we can get from Tuxedo. It just won’t produce any splice junction information.
You have 10 samples, informatively labeled by me based on the conditions. Stationary phase and exponential phase are normal stages of bacterial growth with no additional stressors. Exponential phase is where the bacteria are busily growing and increasing their population exponentially; stationary phase is when they have grown enough that the population becomes resource limited and cells have started to die off to balance growth. Cold, heat, dark, high light (HL) and the other stressors are self-explanatory. The Galaxy server is a busy place and this may take a while to run, and you certainly won’t have time to look at all comparisons across all conditions. So, if you want to reduce your problem space a little bit, here’s what you should minimally try to accomplish:
Compare stationary vs. exponential; compare a condition and its opposite (e.g. heat vs. cold stress), compare a stress condition to stationary and to exponential:
- map reads (TopHat)
- construct transcripts (CuffLinks)
- analyze transcripts (CuffCompare)
- compute differential expression (CuffDiff)
TopHat runs on your reads and produces, among other things, a BAM file (binary alignment). There are a TON of input parameters, and you are not likely to get them “right” or to get the best results without reading up on them and learning from experience. For demonstration purposes you can run with the default parameters.
Your BAM file can also be converted into BAI and BigWig formats, and it can be displayed in Trackster along with the GFF annotations, so as a result you see where individual reads map.
CuffLinks is going to take that BAM file and try to create assembled transcripts out of it as well as calculating FPKM (fragments per kilobase per million reads) which is a normalized measurement of expression level.
CuffLinks will have several outputs, including a tabular gene expression file:
and assembled transcripts that you can compare to the existing annotations on your genome, in Trackster. You could also use the table manipulation tools (which we briefly saw in Galaxy 101) to make comparisons between these files.
CuffCompare takes your assembled transcriptome and compares it to the reference annotation in a statistically rigorous way.
There are several output files from CuffCompare. Key files that you will be interested in are the transcript accuracy summary, the tmap file, and the refmap file. The tmap file contains computed FPKMs. The other files are more useful if you have a eukaryotic genome and you have to track relationships between genes and multiple transcripts that map to them.
CuffDiff works on your expression values from CuffLinks and produces an analysis of differential expression. In our case the input GFF file can just be the genome GFF, because we didn’t make any novel discoveries with CuffCompare. Notice that you need to set up two (or more) conditions and connect your samples to the correct condition. Among the outputs, since we’re working with bacteria that don’t make multiple transcripts per gene, the gene differential expression results are going to be most useful.
Make a simple workflow
To make this procedure into a workflow before you try to repeat it on all your files, you can capture it. Go to the cog at the top of the history panel and select Extract Workflow. You’ll get a list of all the things you did in that history. You can select all the components that you want to re-use.
Next, go into the workflow viewer and see what your workflow looks like. Remember, I ran all the steps on just my stationary and my exponential phase, choosing two files at each step. Except for CuffDiff, which brings together the results from two separate threads in my workflow, you’ll see that each step is replicated twice even though I only actually went through them once. And there are also several inputs that don’t need to be there, because this particular history has all my fastq files loaded even though I only really used two of them.
There’s possibly Galaxy-fu that would let us avoid that and simplify things. Otherwise the CuffDiff comparisons, the only non-linear element, could be set up separately so that one linear workflow could be used on all the samples in a batch. But, let’s say we’re going to keep the workflow this way and set up pairwise comparisons between conditions, at this point. You can get rid of the unused inputs that are not connected to anything else with noodles to clean things up a bit.
Then try running the workflow on another pair of files.
Processing files in a batch
If you do want to try processing a bunch of files in a batch, here’s how to do it. Before switching to the workflow panel, create a new file list. Click the checkbox at the top of your history to do an operation on multiple files. Checkboxes will appear at the side of all the items in your history. Check only the checkboxes on the batch of fastq files that you just imported.
Then click “For All Selected” and choose the option to create a file list. For the operation we’re going to do (running Groomer and FastQC) each of the files can be created separately. We don’t have to pair them. You’ll get a new object in your history — a list. Annoyingly, you can’t rename it. This feature is still in development though.
Then go into the workflow panel and create a workflow. For example if you put the grooming and QC stages of your process into a new workflow, you’d need to collect an input list, groom your data, and send it to FastQC. So you’d have three items in your basic workflow. When you start with a collection of data as input, your workflow connectors (noodles) will be multiples automatically, and in the workflow boxes you’ll see “run in parallel over all items”.
Next run the workflow. On your input options page you’ll be asked to select a file list rather than individual files.
And you will get back multiple outputs when the run completes. This can rapidly get messy, because as you’ll see, your output files just refer back to the numbered items in your Galaxy history. They don’t refer back to the informative filenames that were originally uploaded. You can configure to avoid this problem by using the renaming tools at each workflow step.
To do this in an informative way, you need to rename your files using combinations of internal Galaxy variable names. This is a feature that is apparently still under development at the moment — Dannon Baker gave me some help with this over on the Galaxy Biostars forum but I can’t guarantee this formulation will work for all tools, and in fact I can tell you that the variable name he gave me seems not to work for the Tuxedo tools. If you don’t know the right internal variable names you’ll have to go back and manually rename items in your history after the workflow completes and before too many files proliferate. This is actually incredibly irritating (and it totally violates the 10 simple rules for reproducible computational research) if you’re trying to do anything but the most trivial analysis, but the Galaxy team is actively working on a solution that will let you have easier access to the internal variables that describe your data sets.