BINF 2111: Assembly at the Command Line (Lab)
Today’s lab task is preparation for developing a script that will assemble a batch of small genome sequence data sets. In our script development process there are four main steps:
- Figure out the pipeline that we want to execute, and why (covered in lecture on Tuesday)
- Run the series of steps in the pipeline by hand to make sure that they work, and make sure we understand the outputs
- Write a script that will do exactly what we just did manually, and make sure it works the same way
- Add variables and iterators to the script to allow us to automate the process and apply it to any input file
As we improve the script in subsequent weeks, you’ll learn about several important concepts that will recur in every programming language you learn — variables, test statements, conditionals and iterators.
What you should turn in today is
- a text file with a command line for each of the steps that you need to run to fetch, clean, assemble, and analyze the chloroplast genomes. You can definitely do this in six command lines, but there might be a way to do it in fewer.
- a QUAST report describing the quality of your assembly
- your numbers for the group comparison of results — largest contig, N50, genome fraction, #genes.
Step 1: Get your data
The NCBI Short Read Archive (SRA) contains next-gen sequencing data from thousands of projects and experiments. Students from the BINF 6350 Genomic laboratory course submitted their chloroplast genome sequencing data last year. The project included chloroplast sequence data from 12 different tomato cultivars. The genomes are very similar but they have a few variants that distinguish them.
In the first exercise we will use just one of the datasets. In the table linked above, get the Run number for the dataset labeled BC_30. Then we’ll use the fastq-dump command to get the file onto our own computers (note: if you have a problem getting the file with the version of fastq-dump that’s on the lab computers I will share it with you another way — but trust me this does work if you have the current version of the SRA Toolkit).
First check that you can get the file by downloading the first few lines. fastq-dump sends its output to standard output:
fastq-dump -X 5 -Z SRR1763780
Then get the whole thing and redirect it to a file with a fastq extension:
fastq-dump -Z SRR1763780 > BC_30.fastq
This chloroplast sample was from the German Cherry tomato.
You should look at the overall quality of the read set using the FastQC application that was shown in class. You should also look at the output from FastQC after each step to see how the quality profile of the reads changes when you clip and trim them.
Step 2: Clip adapters
With this whole pipeline of clipping and trimming options, you are balancing two concerns. On the one hand, you want to get rid of bad sequences. On the other hand, you don’t want to be too aggressive and end up with no data left. As you test out your pipeline, look at how the read counts change. Did you accidentally discard all your reads? Try again!
Adapters are little pieces of generic sequence that you definitely want to get rid of from your data. The command line application fastx-clipper will do this for you.
There are two common adapters that might be contaminating your data:
A1 CCATATCATCCCTGCGTGTCTCCCACTCAG TrP1 CCTCTCTATGGGCAGTCGGTGAT
So you will need to run the program twice, once for each adapter. Using the built in help for the fastx-clipper command, design an appropriate command line for clipping each adapter.
At minimum, you need to give fastx-clipper the following input options:
You might also want to use the verbose mode (-v), and try out requiring a minimum adapter alignment length (-M N) and a minimum output sequence length (-l N). For instance adding -M 6 to the command line would mean that adapters wouldn’t get clipped unless at least 6 nucleotides of the sequence align to them (what are the odds of this happening at random?). Adding -l 50 to the command line would mean that sequences would get discarded at this stage if they were already shorter than 50 nucleotides. You can try different values for these options and see how many of your sequences go away when you use them.
Step 3: Trim low-quality a) regions and b) sequences
fastx_trimmer gets rid of particular regions of sequences, while fastq_quality_filter gets rid of whole sequences based on whether they match a quality threshold.
For fastx_trimmer, you can make a judgement of where to start and end the sequence based on what you see in FastQC, as discussed in class. But at least make sure that you trim the reads at this stage so they are no longer than 300 nt. Anything longer than that represents an error in the recording of the sequencing process. So for example in this command line ‘end’ would be 300.
fastx_trimmer -v -f start -l end -i input.fastq -o output.fastq
For fastq_quality_filter the cutoffs are a little different. There’s a minimum quality score that must be reached and a certain percentage of the sequence must reach it. For instance, you could say that your quality score threshhold is 20, but only 50% of the sequence has to be that good. Or you could set a lower q-score threshhold but demand that more of the sequence match it. In a real study you would test different combinations and see what they do to the size of your data set! I’ll go over what we expect in quality scores for IonTorrent data in class.
fastq_quality_filter -v -q minimum -p percentage -i input.fastq -o output.fastq
Step 4: Assemble the genome
Once you finally have the data cleaned, you can assemble it using SPAdes. In past classes this has taken 15-20 minutes on our lab computers.
To make sure your command line is correct for your Ion Torrent data, note that the SPAdes authors provide special instructions specifically for Ion Torrent Reads, which are very different than the default advice for shorter Illumina reads. Make sure your own command line is correct. Your command line needs:
- a special flag indicating the data is Ion Torrent sequence
- a flag indicating that the data is single-end
- a set of k-mer values that are specifically designed for Ion Torrent data
- the name of an output directory to write results to
spades.py --options -s infile -o outdir -k seed,sizes,comma,separated
Step 5: Check the quality of the assembly
You can use the program QUAST to analyze the quality of your assembly and to compare it to a reference genome or to other assemblies. Like SPAdes, QUAST has a lot of options, and it is actually a python pipeline that hides a lot of action. It’s easy to install and comes with built in validation data. QUAST actually runs MUMmer and other analyses and produces a nice graphical report. An example command line for a data analysis with no reference genome might look like this:
~/Applications/quast-2.3/quast.py -o JY1305quast -t 500 –est-ref-size 5100000 –gage JY1305spades/contigs.fasta
This produces a number of files, including an interactive HTML report that you can open in your browser.
For your data set, you’ll need to run a slightly different command line. You will need to specify:
- a SPAdes contigs file as input
- a reference genome (FASTA) file
- a reference annotation (GFF) file
- a contig length filter — remember we’ve got long-ish reads of 300 nt, what size should you use?
- an output directory to write results to
I will show you where to get a reference genome and GFF file during class. Your ending command line will look something like this:
quast -o outdir -R reference.fasta -G reference.gff spadesout/contigs.fasta
Try out each of those commands in order. Once you are satisfied with your results, turn in both a plain text file with your sequence of commands, and your ending QUAST file. Talk to the other students in the class and see how picking different options in the workflow changes the ending results in terms of how many contigs are produced, etc.