BINF 6203: Lab 3 — Genome assembly
Part 1: assemble a chloroplast genome
Today’s first challenge is to assemble the chloroplast genome data that was generated by the BINF 6350 class. You do not have to assemble all of the chloroplasts — pick one file to work with for this exercise. If I were choosing, I’d probably look for one of the larger files (as it will contain more reads) and also verify the quality with FastQC to make sure that the data quality isn’t particularly bad.
The chloroplast genome data wasn’t generated in a truly random manner — it was generated by sequencing overlapping PCR amplicons that are expected to cover the entire genome — but assuming it was, which equation in the Roach paper could you use to figure out the probability that your genome would close, with the amount of data that you have? You might need to make one or two assumptions but can you estimate this probability for your data set, assuming that the coverage from each of the overlapping amplicons is pretty equal?
Based on what you find out about your dataset below, is this a good assumption to make?
There are two different ways you can assemble — pick one
This data assembles quickly enough that you can easily try out different methods and parameters while we’re there to help you in the lab.
1) If you are comfortable at the UNIX command line, you can use the SPAdes assembler and the QUAST assembly analysis program to perform the assembly and check the quality.
The SPAdes assembler will work on Ion Torrent data, but with a specialized set of command line options. SPAdes can take a pretty complicated command line with a lot of different types of reads all combined together, and it wants a directory name for the output (-o flag). The options I used below were to specify that I had one library (pe1) with two types of reads — a paired-end interleaved set, –pe1-12 — and a paired-end single set –pe1-s.
Here’s what a SPAdes command line might look like if I were assembling paired end Illumina data:
spades.py --pe1-12 JY1305.pe.fastq --pe1-s JY1305.se.fastq -o JY1305spades
To make sure your pipeline is correct for your Ion Torrent data, note that the SPAdes authors provide special instructions specifically for Ion Torrent Reads, which are very different than the default advice for shorter Illumina reads. Make sure your own command line is correct. Your command line needs:
- a special flag indicating the data is Ion Torrent sequence
- a flag indicating that the data is single-end
- a set of k-mer values that are specifically designed for Ion Torrent data
- the name of an output directory to write results to
If you solve the problem with SPAdes, test out at least two sets of parameters and compare the results. For instance, you can test to see what the difference in your outcome is if you use the –careful option or not. You can choose to run –only-assembler and not the built-in read error correction.
2) If you want to use CLC Genomics, you can open your chloroplast *.fastq files there instead. You have the option in CLC to either assemble the genome de novo, or to align it against a reference and save the consensus sequence from the alignment. Try both of these methods in CLC. In your analysis, you should comment on the observed genomic coverage of the mapped reads and how/whether there seems to be any correspondence between your highly covered regions and your large contigs.
In CLC, assembly is found in the Tools/de novo Sequencing menu pulldown. You will need to import your data first (remember to use the correct data type when you import — Ion Torrent for the chloroplast *.fastq, but standard import for the reference genome file). Once you have imported your fastq file you’ll be able to run de novo assembly on it.
CLC also has a “Trim Sequences” tool that will perform some kinds of trimming. Trim Sequences is under NGS Core Tools as well. If the chloroplast data file that you load is not one that you’ve already trimmed, you may want to consider using Trim Sequences before you assemble.
Read mapping is found in the Tools/NGS Core Tools menu, as is “Extract Consensus”.
How do you know if your assembly is good?
- Smaller number of contigs
- Larger N50
- More assembled sequence (closer to the anticipated size of your genome)
- Fewer Ns
- High percentage of genes recovered, if you have a reference assembly to compare to as we do for the chloroplast
You can use the QUAST to analyze the quality of your assembly and to compare it to a reference genome or to other assemblies. Like SPAdes, QUAST has a lot of options, and it is actually a python pipeline that hides a lot of action. It’s easy to install and comes with built in validation data. QUAST actually runs MUMmer and other analyses and produces a nice graphical report.
~/Applications/quast-2.3/quast.py -o JY1305quast -t 500 –est-ref-size 5100000 –gage JY1305spades/contigs.fasta
This produces a number of files, including an interactive HTML report that you can open in your browser.
For your data set, you’ll need to run a slightly different command line. You will need to specify:
- a SPAdes contigs file as input
- a reference genome (FASTA) file
- a reference annotation (GFF) file
- a contig length filter — remember we’ve got long-ish reads of 300 nt, what size should you use?
- an output directory to write results to
Part 2: on your own — assemble an E. coli genome, with and without Pac Bio long reads
In the Dropbox (folder E.coli+ PacBio) there are several files, including two SRA files from paired-end Illumina sequencing of E. coli K-12 strain MG1655 (the common lab strain) and several files of Pacific Biosciences long reads. There are two different types of Pac Bio reads. The *CLR.fastq includes very long reads, which tend to be highly error prone. The *CCS.fastq file contains reads derived from Pac Bio’s continuous circular amplification process, which samples a sequence multiple times in the same sequencing run and is less error-prone. Both kinds of reads could be added to an assembly of the Illumina data to help reduce the number of contigs and close the genome.
Use either SPAdes or CLC Genomics to assemble the E. coli data with and without Pac Bio sequence. A complete and closed reference sequence for this genome is available in the NCBI Genomes database, so you can evaluate your results against a very well-known reference.
- In SPAdes: there is a special command line flag to add pacbio sequences to your assembly.
- In CLC Genomics: Pac Bio reads can be imported as if they were Illumina 1.8 reads.
Either of these assemblers will run relatively quickly on a cleaned Illumina paired-end data set of the size we have to work with. If you find that your assembly is running very slowly, you could extract a sub-sample of the reads using seqtk, as described in the first section of the blog post on taming Illumina datasets. Sampling correctly from a paired-end data set should give you a reasonable representation of how the data will behave, especially if the original sample is extremely deep.
Your lab report should include a description of your starting data, the methods you used to approach it (including any parameters that deviate from default or anywhere a choice needed to be made), and your results for each assembly, with discussion of what worked and did not work. Present a comparison of the different methods that you tried, in terms of the common metrics of assembly quality described in class. The assigned GAGE-B paper is excellent for inspiration if you are wondering how to present results.