UPDATED: Sequence Assembly
I’ve asked Juan to upload data to the class directory on the Centaurus cluster (hpc-student.uncc.edu). So instead of transferring from Dropbox, you can go directly to the cluster and copy the data you need into your working directory. You will find:
- sequence reads from one of the better chloroplast genome samples generated with our Ion Torrent instrument
- ERR008613 (a set of paired end Illumina sequence reads from ends of 200bp E. coli fragments)
- ERR022075 (a set of paired end Illumina sequence reads from ends of 600bp E. coli fragments)
- sets of PacBio CCS and CLR reads for E. coli
In the first part of the exercise, you will assemble the Ion Torrent reads alone to see what kind of assembly you get for the chloroplast genome, and to learn the basics of SPAdes.
In the second part, you will progressively introduce long-fragment and long-read data to your E. coli assembly, to see how the addition of these data types changes the assembly results.
Assemble the chloroplast genome
There is just one file to work with for this exercise, BC30_BINF6350_Summer2014_13pm.fastq. Feel free to rename it to something shorter like BC30.fastq.
The chloroplast genome data wasn’t generated in a truly random manner — it was generated by sequencing overlapping PCR amplicons that are expected to cover the entire genome. So you may notice some unusual behavior (completely unsequenced regions, or uneven coverage) in these data sets. This data assembles quickly enough that you can easily try out different methods and parameters.
To run the SPAdes assembler on mamba, you’ll need to type:
module load spades
Available software on the clusters is organized in modules that you can load as needed so you always need to load them before running. Once you load the module, you can type:
To see all of the command line help available with the program. Type:
This verifies that the assembler is working correctly
The SPAdes assembler will work on Ion Torrent data, but with a specialized set of command line options. SPAdes can take a pretty complicated command line with a lot of different types of reads all combined together, and it wants a directory name for the output (-o flag). The options I used below were to specify that I had one library (pe1) with two types of reads — a paired-end interleaved set, –pe1-12 — and a paired-end single set –pe1-s.
Here’s what a SPAdes command line might look like if I were assembling paired end Illumina data:
spades.py --pe1-12 JY1305.pe.fastq --pe1-s JY1305.se.fastq -o JY1305spades
To make sure your command line is correct for your Ion Torrent data, note that the SPAdes authors provide special instructions specifically for Ion Torrent Reads, which are different than the default advice for shorter Illumina reads. Make sure your own command line is correct. Your command line needs:
- a special flag indicating the data is Ion Torrent sequence
- a flag indicating that the data is single-end
- a set of k-mer values that are specifically designed for Ion Torrent data
- the name of an output directory to write results to
spades.py -s BC30.trim.fastq –iontorrent -k 21,33,55,77,99,127 -o BC30spades
WARNING: You can’t just run the jobs on Centaurus with a direct command line command. You need to submit your job with a simple slurm script following the instructions here.
SPAdes will run very quickly on the Ion Torrent data, so you can test out at least two sets of parameters and compare the results. For instance, you can test to see what the difference in your outcome is if you use the –careful option or not. Or you can choose to run –only-assembler and not the built-in read error correction, or see what happens if you use the raw data with and without QC.
Assemble the e coli genome
In the class directory there are several files, including two pairs of SRA files from paired-end Illumina sequencing of E. coli K-12 strain MG1655 (the common lab strain) and two files of Pacific Biosciences long reads. There are two different types of Pac Bio reads. The *CLR.fastq includes very long reads, which tend to be highly error prone. The *CCS.fastq file contains reads derived from Pac Bio’s continuous circular amplification process, which samples a sequence multiple times in the same sequencing run. Both kinds of reads could be added to an assembly of the Illumina data to help reduce the number of contigs and close the genome.
Use SPAdes to assemble the E. coli data with and without Pac Bio sequence. A complete and closed reference sequence for this genome is available in the NCBI Genomes database, so you can evaluate your results against a very well-known reference. Files describing the reference (NC_000913 files) are found in the dropbox.
- In SPAdes: there is a special command line flag to add pacbio sequences to your assembly. The manual describes more than one way to handle them. The manual is concise and easy to follow, so you should take a look at it and design appropriate command lines.
Move results from the cluster to your own computer
To create a tar file (so you can move all your files at once) type:
tar -cvf archivename.tar filenames
To transfer files from mamba to your own computer, use sftp to connect, and the command:
sftp> get archivename.tar
Analyze assembly quality with QUAST
You can use the QUAST applicaton to analyze the quality of your assemblies and to compare to a reference genome or to other assemblies. Like SPAdes, QUAST has a lot of options, and it is actually a python pipeline that hides a lot of action. It’s easy to install on your own computer or a lab Mac using conda, and comes with built in validation data. QUAST actually runs MUMmer and other analyses and produces a nice graphical report. This is an example of what a QUAST command line will look like.
~/Applications/quast-2.3/quast.py -o JY1305quast -t 500 –est-ref-size 5100000 –gage JY1305spades/contigs.fasta
This produces a number of files, including an interactive HTML report that you can open in your browser.
For your data set, you’ll need to run a slightly different command line. You will need to specify:
- a SPAdes contigs file as input
- a reference genome (FASTA) file
- a reference annotation (GFF) file
- a contig length filter — remember we’ve got long-ish reads of 300 nt, what size should you use?
- an output directory to write results to
Your lab report should include a description of your starting data, the methods you used to approach it (including any parameters that deviate from default or anywhere a choice needed to be made), and your results for each assembly, with discussion of what worked and did not work. Present a comparison of the different combinations of input data that you tried. The assigned GAGE-B paper is excellent for inspiration if you are wondering how to present results.