BINF 6203: NGS Genome Assembly
Part 1: assemble a chloroplast genome
Today’s first challenge is to assemble the chloroplast genome data that was generated by the BINF 6350 class. You do not have to assemble all of the chloroplasts — pick one file to work with for this exercise. If I were choosing, I’d probably look for one of the larger files (as it will contain more reads) and also verify the quality with FastQC to make sure that the data quality isn’t particularly bad.
The chloroplast genome data wasn’t generated in a truly random manner — it was generated by sequencing overlapping PCR amplicons that are expected to cover the entire genome. So you may notice some unusual behavior (completely unsequenced regions, or uneven coverage) in these data sets.
This data assembles quickly enough that you can easily try out different methods and parameters while we’re there to help you in the lab.
To assemble this data, you can use the SPAdes assembler and the QUAST assembly analysis program to perform the assembly and check the quality.
The SPAdes assembler will work on Ion Torrent data, but with a specialized set of command line options. I would like you to look at the help for SPAdes and see if you can figure out the appropriate parameters for Ion Torrent. SPAdes can take a pretty complicated command line with a lot of different types of reads all combined together, and it wants a directory name for the output (-o flag). The options I used below were to specify that I had one library (pe1) with two types of reads — a paired-end interleaved set, –pe1-12 — and a paired-end single set –pe1-s.
Here’s what a SPAdes command line might look like if I were assembling paired end Illumina data:
spades.py --pe1-12 JY1305.pe.fastq --pe1-s JY1305.se.fastq -o JY1305spades
To make sure your pipeline is correct for your Ion Torrent data, note that the SPAdes authors provide special instructions specifically for Ion Torrent Reads, which are very different than the default advice for shorter Illumina reads. Make sure your own command line is correct. Your command line needs:
- a special flag indicating the data is Ion Torrent sequence
- a flag indicating that the data is single-end
- a set of k-mer values that are specifically designed for Ion Torrent data
- the name of an output directory to write results to
SPAdes will run very quickly on the Ion Torrent data, so you can test out at least two sets of parameters and compare the results. For instance, you can test to see what the difference in your outcome is if you use the –careful option or not. Or you can choose to run –only-assembler and not the built-in read error correction, or see what happens if you use the raw data without QC and with.
You can use the QUAST to analyze the quality of your assembly and to compare it to a reference genome or to other assemblies. Like SPAdes, QUAST has a lot of options, and it is actually a python pipeline that hides a lot of action. It’s easy to install and comes with built in validation data. QUAST actually runs MUMmer and other analyses and produces a nice graphical report. This is an example of what a QUAST command line will look like.
~/Applications/quast-2.3/quast.py -o JY1305quast -t 500 –est-ref-size 5100000 –gage JY1305spades/contigs.fasta
This produces a number of files, including an interactive HTML report that you can open in your browser.
For your data set, you’ll need to run a slightly different command line. You will need to specify:
- a SPAdes contigs file as input
- a reference genome (FASTA) file
- a reference annotation (GFF) file
- a contig length filter — remember we’ve got long-ish reads of 300 nt, what size should you use?
- an output directory to write results to
Part 2: on your own — assemble an E. coli genome, with and without Pac Bio long reads
In the Dropbox (folder E.coli+ PacBio) there are several files, including two SRA files from paired-end Illumina sequencing of E. coli K-12 strain MG1655 (the common lab strain) and several files of Pacific Biosciences long reads. There are two different types of Pac Bio reads. The *CLR.fastq includes very long reads, which tend to be highly error prone. The *CCS.fastq file contains reads derived from Pac Bio’s continuous circular amplification process, which samples a sequence multiple times in the same sequencing run and is less error-prone. Both kinds of reads could be added to an assembly of the Illumina data to help reduce the number of contigs and close the genome.
Use SPAdes to assemble the E. coli data with and without Pac Bio sequence. A complete and closed reference sequence for this genome is available in the NCBI Genomes database, so you can evaluate your results against a very well-known reference.
- In SPAdes: there is a special command line flag to add pacbio sequences to your assembly.
- In CLC Genomics: Pac Bio reads can be imported as if they were Illumina 1.8 reads.
Either of these assemblers will run relatively quickly on a cleaned Illumina paired-end data set of the size we have to work with. If you find that your assembly is running very slowly, you could extract a sub-sample of the reads using seqtk, as described in the first section of the blog post on taming Illumina datasets. Sampling correctly from a paired-end data set should give you a reasonable representation of how the data will behave, especially if the original sample is extremely deep.
Your lab report should include a description of your starting data, the methods you used to approach it (including any parameters that deviate from default or anywhere a choice needed to be made), and your results for each assembly, with discussion of what worked and did not work. Present a comparison of the different methods that you tried, in terms of the common metrics of assembly quality described in class. The assigned GAGE-B paper is excellent for inspiration if you are wondering how to present results.