Sequence Assembly

Sequence Assembly

For today’s class, I have prepared a shared dropbox folder that contains the following items:

  • sequence reads from one of the better chloroplast genome samples generated with our Ion Torrent instrument
  • ERR008613 (a set of paired end Illumina sequence reads from ends of 200bp E. coli fragments)
  • ERR022075 (a set of paired end Illumina sequence reads from ends of 600bp E. coli fragments)
  • sets of PacBio CCS and CLR reads for E. coli

In the first part of the exercise, you will assemble the Ion Torrent reads alone to see what kind of assembly you get for the chloroplast genome, and to learn the basics of SPAdes.

In the second part, you will progressively introduce long-fragment and long-read data to your E. coli assembly, to see how the addition of these data types changes the assembly results.

Log in and upload files to mamba

The files you will need can be found in Dropbox at this link. You can take them one by one or use the compressed archive files. sequences.tar.gz contains all of the sequence files you will need for the lab, and references.tar.gz contains reference genome sequence and annotation information for assembly quality checking.

To move files to mamba, you can use the sftp or scp protocols. To use sftp, type:


— this connects you to the host

Then, at the sftp> prompt, type “put filename” or “mput filenames” to move files. If you transfer just the sequences.tar.gz file, you would type “put sequences.tar.gz”.

Once the transfer is done, type quit at the sftp prompt, and reaccess mamba using the command “ssh mamba”. This will give you a regular UNIX command line interface to mamba. You should be able to type “ls” in your home directory and see the sequences.tar.gz archive file. Extract it by using the commands:

gunzip sequences.tar.gz

and then

tar -xvf sequences.tar

The sequence files are about 2.6 GB total. For reference, your quota on mamba is 150GB.

Downloading these files can take some time, as can transferring the sequences to mamba from your home computer. It’s probably best to do it from a fast connection and not over public wifi. If you want to you can extract the archive on your own computer or get the BC30 file on its own and transfer that first so you can do the assembly of the chloroplast while waiting for the other files to transfer.

Assemble the chloroplast genome

There is just one file to work with for this exercise, BC30_BINF6350_Summer2014_13pm.fastq. Feel free to rename it to something shorter like BC30.fastq.

The chloroplast genome data wasn’t generated in a truly random manner — it was generated by sequencing overlapping PCR amplicons that are expected to cover the entire genome. So you may notice some unusual behavior (completely unsequenced regions, or uneven coverage) in these data sets.  This data assembles quickly enough that you can easily try out different methods and parameters.

To run the SPAdes assembler on mamba, you’ll need to type:

module load spades

Available software on the clusters is organized in modules that you can load as needed so you always need to load them before running. Once you load the module, you can type:

To see all of the command line help available with the program. Type: --test

This verifies that the assembler is working correctly

The SPAdes assembler will work on Ion Torrent data, but with a specialized set of command line options. SPAdes can take a pretty complicated command line with a lot of different types of reads all combined together, and it wants a directory name for the output (-o flag). The options I used below were to specify that I had one library (pe1) with two types of reads — a paired-end interleaved set, –pe1-12 — and a paired-end single set –pe1-s.

Here’s what a SPAdes command line might look like if I were assembling paired end Illumina data: --pe1-12 --pe1-s -o JY1305spades

To make sure your command line is correct for your Ion Torrent data, note that the SPAdes authors provide special instructions specifically for Ion Torrent Reads, which are different than the default advice for shorter Illumina reads. Make sure your own command line is correct. Your command line needs:

  • a special flag indicating the data is Ion Torrent sequence
  • a flag indicating that the data is single-end
  • a set of k-mer values that are specifically designed for Ion Torrent data
  • the name of an output directory to write results to -s BC30.trim.fastq –iontorrent -k 21,33,55,77,99,127  -o BC30spades

WARNING:  You can’t just run the jobs on mamba with a direct command line command. You need to submit your job with a simple qsub script following the instructions here.

SPAdes will run very quickly on the Ion Torrent data, so you can test out at least two sets of parameters and compare the results. For instance, you can test to see what the difference in your outcome is if you use the –careful option or not. Or you can choose to run –only-assembler and not the built-in read error correction, or see what happens if you use the raw data with and without QC.

Assemble the e coli genome

In the Dropbox there are several files, including two pairs of SRA files from paired-end Illumina sequencing of E. coli K-12 strain MG1655 (the common lab strain) and two files of Pacific Biosciences long reads. There are two different types of Pac Bio reads. The *CLR.fastq includes very long reads, which tend to be highly error prone. The *CCS.fastq file contains reads derived from Pac Bio’s continuous circular amplification process, which samples a sequence multiple times in the same sequencing run. Both kinds of reads could be added to an assembly of the Illumina data to help reduce the number of contigs and close the genome.

Use SPAdes to assemble the E. coli data with and without Pac Bio sequence. A complete and closed reference sequence for this genome is available in the NCBI Genomes database, so you can evaluate your results against a very well-known reference. Files describing the reference (NC_000913 files) are found in the dropbox.

  • In SPAdes: there is a special command line flag to add pacbio sequences to your assembly. The manual describes more than one way to handle them. The manual is concise and easy to follow, so you should take a look at it and design appropriate command lines.

Move results from the cluster to your own computer

To create a tar file (so you can move all your files at once) type:

tar -cvf archivename.tar filenames

To transfer files from mamba to your own computer, use sftp to connect, and the command:

sftp> get archivename.tar

Analyze assembly quality with QUAST

You can use the QUAST to analyze the quality of your assemblies and to compare to a reference genome or to other assemblies. Like SPAdes, QUAST has a lot of options, and it is actually a python pipeline that hides a lot of action. It’s easy to install on your own computer or a lab Mac, and comes with built in validation data. QUAST actually runs MUMmer and other analyses and produces a nice graphical report. This is an example of what a QUAST command line will look like.

~/Applications/quast-2.3/ -o JY1305quast -t 500 –est-ref-size 5100000 –gage JY1305spades/contigs.fasta

This produces a number of files, including an interactive HTML report that you can open in your browser.

For your data set, you’ll need to run a slightly different command line. You will need to specify:

  • a SPAdes contigs file as input
  • a reference genome (FASTA) file
  • a reference annotation (GFF) file
  • a contig length filter — remember we’ve got long-ish reads of 300 nt, what size should you use?
  • an output directory to write results to

Your lab report should include a description of your starting data, the methods you used to approach it (including any parameters that deviate from default or anywhere a choice needed to be made), and your results for each assembly, with discussion of what worked and did not work. Present a comparison of the different combinations of input data that you tried. The assigned GAGE-B paper is excellent for inspiration if you are wondering how to present results.

Comments are closed.