BINF 2111: Assembly at the command line
The most common things that you’ll want to do with shell scripts in bioinformatics are 1) data manipulation and 2) driving programs to run automatically, collecting their output, and feeding it into other programs.
To successfully build a script that drives a bioinformatics pipeline, you need to:
- understand the process you’re looking to execute
- identify appropriate programs for the process and for the data you have
- understand the inputs, outputs, and tuneable parameters of each program
Then you need to test the sequence of commands once separately, and once as a pipeline, before you start doing things like setting up repeating loops or passing whole lists of filenames to your script.
You have a large number of chloroplast genome sequences to process. They need to be cleaned and trimmed using the FASTX Toolkit, which is installed on your computer already.
The SPAdes assembler will work on Ion Torrent data, but with a specialized set of command line options that you must research (yes, you’re going to have to RTFM a little bit).
Once you have assemblies for each chloroplast genome data set, you can analyze them in QUAST, which will give you a report on the quality of the assembly.
The first step in making this an automated software pipeline that can be run over and over again on many files, is to figure out the series of command lines that you will need, and test that they work by running them in sequence on one of the chloroplast data sets. Then next week we’ll turn them into a script and introduce variables into our scripting process so you can easily run the script on multiple files.
What you should turn in today is
- a text file with a command line for each of the steps that you need to run to clean, assemble, and analyze the chloroplast genomes. You can definitely do this in six command lines, but there might be a way to do it in fewer.
- a QUAST report describing the quality of your assembly
- your numbers for the group comparison of results — largest contig, N50, genome fraction, #genes.
Developing a pipeline for assembly
You need to do the following steps — in order.
- clip residual sequencing adapters
- trim off parts of sequences that are clearly too long and possibly a few bases at the beginning of the sequence if they look anomalous
- filter the sequence that’s left to make sure none of it is too low-quality
Step 1: Clip adapters
You figured out how to do this on your quiz this morning. At this step (which is actually two steps) make sure that you
- get rid of the A1 and TrP1 adapters in your adapters.tab file
- retain sequences with occasional Ns (if you have any — do you even really need to use this flag?)
- keep ALL of the sequences, whether there was adapter found in them or not
- but throw out sequences that get clipped down to less than 50nt at this step
Step 2: Trim for length
At this step you’ll get rid of any sequences that are longer than they should be. Anything that has more than 350 nucleotides definitely needs to get its tail end trimmed off. You may want to trim a little off the front end of the sequence as well, and if it looks like the tail end of the sequences is always bad, you may want to trim to 300 instead of 350, for instance.
Step 3: Filter for quality
As we discussed a little bit in class, the Ion Torrent quality scores are calibrated a little bit low. You don’t want to be too strict with your trimming or you won’t have any sequences left. The combination of factors here is
- what is your quality score threshhold
- how much of the sequence has to meet that threshhold
For instance, you could say that your quality score threshhold is 20, but only 50% of the sequence has to be that good. Or you could set a lower q-score threshhold but demand that more of the sequence match it. If you have time to test different combinations and see what they do to the size of your data set, do it!
Step 4: Assemble
To do this we’ve got to install SPAdes. On my Mac, I was able to do this following exactly the installation instructions in the manual. I compiled rather than downloading binaries, and I tested the compile using the test data set distributed with the software, as instructed. However, if you have trouble with the compile you can try using the binaries directly.
SPAdes itself is a python pipeline that hides the execution of several steps. You can optionally turn its error correction step on and off, you can run only assembly, you can restart the pipeline if it stops for some reason.
SPAdes can take a pretty complicated command line with a lot of different types of reads all combined together, and it wants a directory name for the output (-o flag). The options I used below were to specify that I had one library (pe1) with two types of reads — a paired-end interleaved set, –pe1-12 — and a paired-end single set –pe1-s. If I wanted to add my old Trimmomatic orphans in, I could concatenate them all into one single reads file at this point.
Here’s what a SPAdes command line would look like if I were assembling paired end Illumina data:
spades.py --pe1-12 JY1305-thirdpass.pe.fastq --pe1-s JY1305-thirdpass.se.fastq -o JY1305spades
To make sure your pipeline is correct for your Ion Torrent data, note that the SPAdes authors provide special instructions specifically for Ion Torrent Reads, which are very different than the default advice for shorter Illumina reads. Make sure your own command line is correct. Your command line needs:
- a special flag indicating the data is Ion Torrent sequence
- a flag indicating that the data is single-end
- a set of k-mer values that are specifically designed for Ion Torrent data
- the name of an output directory to write results to
Step 5: Assembly analysis with QUAST
You can use the relatively new program QUAST to analyze the quality of your assembly and to compare it to a reference genome or to other assemblies. Like SPAdes, QUAST has a lot of options, and it is actually a python pipeline that hides a lot of action. It’s easy to install and comes with built in validation data. QUAST actually runs MUMmer and other analyses and produces a nice graphical report.
~/Applications/quast-2.3/quast.py -o JY1305quast -t 500 –est-ref-size 5100000 –gage JY1305spades/contigs.fasta
This produces a number of files, including an interactive HTML report that you can open in your browser.
For your data set, you’ll need to run a slightly different command line. You will need to specify:
- a SPAdes contigs file as input
- a reference genome (FASTA) file
- a reference annotation (GFF) file
- a contig length filter — remember we’ve got long-ish reads of 300 nt, what size should you use?
- an output directory to write results to
Valuable UNIX user tip — aliases
On p. 308 and 309 in your book, Haddock and Dunn cover the concept of aliases. An alias is a short way of referring to a command you use a lot so you don’t have to type it over and over. Try out the examples in that section and think about how you might use them as you go through the exercise. You can set an alias at the command line, which will apply only in the window you are working in and only until you close it, or you can make your alias permanent by putting it in your ~/.bash_profile.
Say you’ve got Quast installed in a directory that’s not in your path and you don’t want to add a whole bunch of nonstandard locations to your path but you also don’t want to move all Quast’s stuff to /usr/local:
…which is a command fragment that you can follow with Quast’s regular command line options.