BINF 6215: Figuring out UNIX command line software
The most common things that you’ll want to do with shell scripts in bioinformatics are 1) data manipulation (which is what we practiced this morning) and 2) driving programs to run automatically, collecting their output, and feeding it into other programs. The little UNIX commands that are part of your operating system are powerful, but usuallly you’ll end up wanting to run programs that other people have developed.
In order to successfully build a script that drives a bioinformatics pipeline, the first things you have to do are understand the process you’re looking to execute, identify appropriate programs for the process and for the data you have, and then understand the inputs, outputs, and tuneable parameters of each program. Then you need to test the sequence of commands once separately, and once as a pipeline, before you start doing things like setting up repeating loops or passing whole lists of filenames to your script.
Developing a pipeline — walk-through example
What I want to do is take the sequence data that I have for Vibrio vulnificus JY1305, and see if I can improve the genome assembly by using a new process called digital normalization. I’m also going to switch to using the SPAdes assembler, rather than Velvet, which was used when we originally put this genome together back in 2012.
I have one Illumina lane of paired-end sequence data available. The reads are on average about 100 nt in length. I don’t have any mate pair sequence or long reads to help boost the assembly, and although I have a pretty closely related reference genome, one of my objectives in the study is to assemble regions that don’t map anywhere in that reference genome, because bacteria tend to move genetic material around a lot by horizontal transfer.
If my pipeline is successful, I can use it to improve assembly of a few dozen other Vibrio vulnificus genomes that we have data available for, so I definitely want to automate this process.
Valuable UNIX user tip — aliases
On p. 308 and 309 in your book, Haddock and Dunn cover the concept of aliases. An alias is a short way of referring to a command you use a lot so you don’t have to type it over and over. Try out the examples in that section and think about how you might use them as you go through the exercise. You can set an alias at the command line, which will apply only in the window you are working in and only until you close it, or you can make your alias permanent by putting it in your ~/.bash_profile.
Say you are doing your work in a directory called ~/Documents/DATA/NGS/chloroplast and you don’t want to type it out every time you want to change to that directory because even with filename completion, it’s a pain. You could make an alias!
alias gocp="cd ~/Documents/DATA/NGS/chloroplast"
Say you’ve got Quast installed in a directory that’s not in your path and you don’t want to add a whole bunch of nonstandard locations to your path but you also don’t want to move all Quast’s stuff to /usr/local:
alias quast='~/Applications/quast-2.3/quast.py'
…which is a command fragment that you can follow with Quast’s regular command line options.
Identifying the software
I know that I want to digitally normalize my data with khmer and assemble it with SPAdes. But I also have to get the data ready to input to those programs, and I probably want to have some way of filtering and examining my output, even if it’s rudimentary. The authors of khmer recommend trimming for sequence quality prior to running that program, and I know that if nothing else I can use a long sequence alignment program like MUMmer to compare my contigs to other contigs, or to a reference genome. So my basic process is:
- Trim (Trimmomatic)
- Normalize (khmer tools)
- Assemble (SPAdes)
- Filter (QUAST)
- Align (QUAST)
- Display (QUAST)
If I look more closely at each of these steps, I may find that there are multiple elements to each step, and that there are glue steps in between that just involve converting files from one format to another to be acceptable to the next program.
Step 1: Trim
There are a lot of different trimming programs out there. Since we have paired end Illumina data in this case, it’s appropriate to use Trimmomatic, a Java program that should already be installed on your computer, most likely in the system Applications folder. Trimmomatic can perform several needed operations in a single command line. It is the subject of a published, peer-reviewed paper in the journal Bioinformatics, which means that it’s been vetted and most likely will give you trustworthy results.
Let’s say we want to get rid of adapter sequences, trim really bad quality sequences from the beginning and end of the read, use a sliding window to find where lower-quality sequences start in the read and trim the rest of the read after quality falls below a given threshhold, and then drop the read if all that trimming has made it too short.
The relevant Trimmomatic options are:
- ILLUMINACLIP: Cuts adapter and other illumina-specific sequences from the read. The Trimmomatic authors distribute adapter reference sequences along with the software, and these should be found in a subdirectory in the Trimmomatic directory. Because the sequence we have was was generated using an up-to-date Illumina pipeline, we can use the “TruSeq3-PE.fa” adapter file. This should always be done first.
- SLIDINGWINDOW: Performs a sliding window trimming, cutting once the average quality within the window falls below a threshold.
- LEADING: Cuts bases off the start of a read, if below a threshold quality
- TRAILING: Cuts bases off the end of a read, if below a threshold quality
- MINLEN: Drops the read if it is below a specified length
The application needs to know if your sequence is paired end (so it can properly handle orphaned reads after trimming) and what the quality score encoding is. Remember, some older Illumina data may be Phred64 encoded, but with this dataset we can use Phred33.
The application needs two un-interleaved fastq input files, one for your forward and one for your reverse reads. It needs to know where you want to place the output. It will produce four output files — forward and reverse paired read files (that can later be interleaved if needed) and forward single and reverse single files containing your orphans.
So, the resulting command line (all one continuous line) will be:
java -jar ~/Applications/Trimmomatic-0.32/trimmomatic-0.32.jar PE -phred33 100_1.fastq 100_2.fastq JY1305_forward_paired.fastq JY1305_forward_unpaired.fastq JY1305_reverse_paired.fastq JY1305_reverse_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Step 2: Normalize
I wrote up the digital normalization install process HERE. You do have to install this one in your own Applications directory. It should go smoothly with the user permissions we’ve been given. Install it and take a look at what is part of the package, and start the khmerEnv environment.
The key function that we really want to be able to do is to “normalize by median”. That script needs an interleaved fastq file to run, so we first have to interleave the paired fastq files with another khmer script, interleave-reads.py. That script takes two input files and writes them to a single output file. The -o flag marks the output file, and comes directly before the output filename.
interleave-reads.py JY1305_forward_paired.fastq JY1305_reverse_paired.fastq -o JY1305-trimmed-interleaved.fastq
Next we start the real normalization. This requires several inputs. You first need the -p flag to tell the program that your data is interleaved paired end. khmer is getting rid of some of your data based on how many times particular kmers are represented in the data set. The idea being that excessive depth makes your deBruijn graph more complex (thereby slowing down assembly) and doesn’t really provide any novel information. So you need to set a kmer size (-k) and a k-mer coverage threshhold (-c). The authors suggest that for the first pass to prepare data for assembly, both of these can be set to 20. Then you need to specify the number and size of kmer tables that the program uses. Basically here you are specifying how much of your system’s RAM the program is allowed to use while running. The bigger the better; too small and the run will fail. On my laptop, with 16 GB of RAM, I ran these with 4 tables of 3e9 (3GB) each, and it was slow, but doable. And then of course you need to specify an output (-o flag) and give the name of the file to input.
normalize-by-median.py -p -k 20 -C 20 --n_tables 4 --min-tablesize 3e9 -o JY1305-firstpass.pe.fastq JY1305-trimmed-interleaved.fastq
You should also do this for each of the single-end read files from the Trimmomatic trim process if you want to include them in your assembly. The command lines will be slightly different — it will not have the -p flag that indicates paired end.
Then for each input, you need to do abundance filtering. But as we can find out by typing “filter-abund.py -h” at the command line, to do abundance filtering, you first need to make a counting table by running load-into-counting.py.
Both of these programs need a kmer size (-k, recommended value 20) and a table size, just like normalize-by-median.py.
load-into-counting.py -k 20 -x 2e9 JY1305table.kh JY1305-firstpass.pe.fastq
filter-abund.py -k 20 --n_tables 4 --min-tablesize 1e9 -C 1 -o JY1305-secondpass.fastq JY1305table.kh JY1305-firstpass.fastq
The filtering process is going to end up dropping some halves of read pairs out of your data. In order to get back to a properly paired input data set for the next steps, you actually need to re-extract just the pairs using another khmer script called extract-paired-reads.py:
extract-paired-reads.py JY1305-secondpass.fastq
Which will produce a paired-end and a single end output, both of which need to be normalized. If you’ve been paying attention to file proliferation here, you might realize that in addition to your paired end data there are now quite a few single-end files produced at different points in the process — two from Trimmomatic and one from khmer — and that each of those is going to have its own downstream workflow that needs to be scripted. You could make the executive decision to drop these single end reads and only work with the reads that stay paired until the bitter end, but if you decide to keep them, your next workflow step is a re-normalization (down to a k-mer coverage threshhold of 5) for each set.
normalize-by-median.py -p -k 20 -C 5 --n_tables 4 --min-tablesize 1e9 -o JY1305-thirdpass.pe.fastq JY1305-secondpass.fastq.pe
normalize-by-median.py -k 20 -C 5 --n_tables 4 --min-tablesize 1e9 -o JY1305-thirdpass.se.fastq JY1305-secondpass.fastq.se
Plus, additional read files from your forward and reverse Trimmomatic reads need to be processed again if you kept them alive throughout.
Fortunately, assembly with SPAdes is going to bring these read files back together at the next step in your workflow.
Step 3: Assemble
Now we’ve got to install SPAdes. On my Mac, I was able to do this following exactly the installation instructions in the manual. I compiled rather than downloading binaries, and I tested the compile using the test data set distributed with the software, as instructed. However, if you have trouble with the compile you can try using the binaries directly.
There are other assemblers installed on your computer, including Velvet, which runs with the help of a script called VelvetOptimiser, and you could certainly use Velvet for these reads, but as I was testing out options leading up to this class, I got the best assembly out of SPAdes. SPAdes is gaining reputation as a good assembler and performed well on the GAGE-B assembly benchmarks. Here’s what the command line would look like if I didn’t keep anything but the paired end file from my Step 1 trim alive to this point in the analysis:
spades.py --pe1-12 JY1305-thirdpass.pe.fastq --pe1-s JY1305-thirdpass.se.fastq -o JY1305spades
SPAdes itself is a python pipeline that hides the execution of several steps. You can optionally turn its error correction steop on and off, you can run only assembly, you can restart the pipeline SPAdes can take a pretty complicated command line with a lot of different types of reads all combined together, and it wants a directory name for the output (-o flag). The options I used here were to specify that I had one library (pe1) with two types of reads — a paired-end interleaved set, –pe1-12 — and a paired-end single set –pe1-s. If I wanted to add my old Trimmomatic orphans in, I could concatenate them all into one single reads file at this point.
For future reference, note that the SPAdes authors provide special instructions specifically for Ion Torrent Reads, which are different than the default advice for shorter Illumina reads. This will be relevant to your interests.
Step 4: Filter
If you look at your SPAdes assembly, you’ll find that (with parameters set as default) you have a lot of short little contigs. But maybe those aren’t so interesting, actually. At least at first. Fortunately, the other day we learned how to filter FASTA sequences by length using a one-line command in bioawk. Say I want to filter the FASTA file containing my output scaffolds and keep only those that are greater than 1000 nt? I could use a variation on that bioawk command:
bioawk -c fastx '{ if(length($seq) > 1000) { print ">"$name; print $seq }}' scaffolds.fasta > long.scaffolds.fasta
This will probably strip out some sequence that I’d be interested in keeping, but it will simplify the results you have to look at in the next step, so we’ll keep the threshhold at 1000. For perspective, CLC’s default threshhold for discarding contigs is 200nt, the suggested threshhold in QUAST (see below) is 500, and you can also discard on the basis of things like coverage or how many reads make up a scaffold.
Step 5 and 6: Assembly analysis with QUAST
You could, if you want to, use the relatively new program QUAST to analyze the quality of your assembly and to compare it to a reference genome or to other assemblies, rather than doing manual filtering and alignment in MUMmer. Like SPAdes, QUAST has a lot of options, and it is actually a python pipeline that hides a lot of action. It’s easy to install and comes with built in validation data. QUAST actually runs MUMmer and other analyses and produces a nice graphical report, as well as providing a command line option to run the same contig length filter we just applied. I used it to analyze the SPAdes assembly of JY1305, with the command line:
~/Applications/quast-2.3/quast.py -o JY1305quast -t 500 --est-ref-size 5100000 --gage JY1305spades/contigs.fasta
This produces a number of files, including an interactive HTML report that you can open in your browser, shown partially below.
Your mission
You have a large number of chloroplast genome sequences to process. They do need to be cleaned and trimmed, but Trimmomatic doesn’t work on Ion Torrent data and khmer is not really necessary for data sets that are this small. You need to identify an appropriate program that will do your trimming for you. You might want to take a look at some of the other commands in seqtk for this, or at the FASTX Toolkit, which is (supposed to be) installed on your computer already.
The SPAdes assembler will work on Ion Torrent data, but with a specialized set of command line options that you must research, and once you have assemblies for each chloroplast genome data set, you can analyze them in QUAST — except that you’ll have a reference genome to compare to so you’ll have to modify your command line accordingly. QUAST will even do some basic annotation for you using GeneMark, and you can input annotations associated with your reference FASTA file to check them against.
Work out the series of command lines that you will need and run them in sequence on one of the chloroplast data sets. Then tomorrow we’ll turn them into a script and introduce variables into our scripting process so you can easily run the script on multiple files.