Lab 1: Sequence QC And Assembly
The sequence data files for the exercise are preloaded on the hpc-student cluster in directories /projects/class/yoursection. Students registered for 8203 have a different directory. It contains several files of sequencing reads that you can attempt to use together or separately to create an assembly.
- ERR008613 (a set of paired end Illumina sequence reads from ends of 200bp E. coli fragments)
- ERR022075 (a set of paired end Illumina sequence reads from ends of 600bp E. coli fragments)
- sets of PacBio CCS and CLR reads for E. coli
In the first part of the exercise, you will trim and assemble the 200bp fragment Illumina reads alone to see what kind of assembly you get for the E. coli genome, and to learn the basics of SPAdes.
In the second part, you will review the SPAdes instructions and figure out how to progressively introduce long-fragment and long-read data to your E. coli assembly, to see how the addition of these data types changes the assembly results.
You will use QUAST to evaluate parameters of the assembly, such as contig N50, number of contigs, and comparisons to the reference genome.
Short show-and-tell videos for each program are available on the Youtube playlist.
View the reads with FastQC
Last week you should have downloaded FastQC and used it to view the Ion Torrent sequence dataset in the small file I posted on Canvas. Now use FastQC to examine the sequence reads for this week’s exercise. For your lab report, make note of any potential problems you see with the reads. You can use figures from FastQC as illustrations.
Take a look at your ERR008613 reads using FastQC. What problems can you identify with these reads?
Trim reads with trimmomatic
Trimmomatic is available as a module on the hpc-student cluster. It is a commonly used Illumina read cleaning program from the Usadel Lab. It is a java executable (*.jar) file. The full command line options are listed in detail on the github site. I went through some of the most important ones in the video.
REMINDER: You can’t just run the jobs on Centaurus with a direct command line command. You need to submit your job with a simple slurm script following the instructions here.
As an example, here is how you would format a command line to trim paired-end reads only to remove adapters as I showed in the video. You should look at the options and figure out how to use at least a sliding window quality trim, and try some of the other options to see what their effect on your data is.
java -jar $TRIM/trimmomatic-0.39.jar PE ERR022075_1.fastq ERR022075_2.fastq ERR022075_1_paired.fq ERR022075_1_unpaired.fq ERR022075_2_paired.fq ERR022075_2_unpaired.fq ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE.fa:2:30:10
Assemble the genome with Spades using the 200bp fragment reads
Next we’ll load the Spades module on the cluster and run an assembly.
Once you load the module, you can type:
spades.py
To see all of the command line help available with the program. Type:
spades.py --test
This verifies that the assembler is working correctly.
SPAdes can take a pretty complicated command line with a lot of different types of reads all combined together, and it wants a directory name for the output (-o flag). You can use spades.py –help to see all the options or look at the manual page.
Here’s what a SPAdes command line would look like if I were assembling just one set of forward and reverse read data, including the single-end reads that got orphaned in the trimming process. I concatenated the single end files out of Trimmomatic into one file:
spades.py --pe1-1 ERR008613_1_paired.fq --pe1-2 ERR008613_2_paired.fq --se ERR008613_unpaired.fq -o illumina_only
Don’t copy these exactly — remember you have to use your actual filenames and choose your own options!
SPAdes will run relatively quickly on a single set of paired end reads, so you can test out at least two sets of parameters and compare the results. For instance, you can test to see what the difference in your outcome is if you use the –careful option or not. Or you can choose to run –only-assembler and not the built-in read error correction, or see what happens if you use the raw data with and without QC.
Assemble the e coli genome
In the class directory there are several files, including two pairs of SRA files from paired-end Illumina sequencing of E. coli K-12 strain MG1655 (the common lab strain) and two files of Pacific Biosciences long reads. There are two different types of Pac Bio reads. The *CLR.fastq includes very long reads, which tend to be highly error prone. The *CCS.fastq file contains reads derived from Pac Bio’s continuous circular amplification process, which samples a sequence multiple times in the same sequencing run. Both kinds of reads could potentially be added to an assembly of the Illumina data to help reduce the number of contigs and close the genome, although the Spades authors do not really recommend using the CLR reads.
Use SPAdes to assemble the E. coli data with and without Pac Bio sequence. You can try this with just one of the two libraries, or with both, and you could try assembling the PacBio CCS reads on their own. A complete and closed reference sequence for this genome is available in the NCBI Genomes database, so you can evaluate your results against a very well-known reference.
- In SPAdes: there is a special command line flag –pacbio to add pacbio sequences to your assembly. The manual describes more than one way to handle them. The manual is easy to follow, so you should take a look at it and design appropriate command lines.
Analyze assembly quality with QUAST
You can use the QUAST applicaton to analyze the quality of your assemblies and to compare to a reference genome or to other assemblies. It also exists on the cluster as a module. QUAST actually runs MUMmer and other analyses and produces a nice graphical report. This is an example of what a QUAST command line would look like. QUAST output is HTML, and so you will need to move it to your own computer to view the output file.
quast.py -o JY1305quast -t 500 --est-ref-size 5100000 --gage JY1305spades/contigs.fasta
This produces a number of files, including an interactive HTML report that you can open in your browser.
For your data set, take a look at the help options and/or the manual and figure out how to add a reference genome file. You will need to specify:
- a SPAdes contigs file as input
- a reference genome (FASTA) file
- a reference annotation (GFF) file
- a contig length filter
- an output directory to write results to
Files describing the reference can be downloaded from NCBI here. You will need both the FASTA and the GFF3 files for this genome.
Your lab report should include a description of your starting data, the methods you used to approach it (including any parameters that deviate from default or anywhere a choice needed to be made), and your results for each assembly, with discussion of what worked and did not work. Present a comparison of the different combinations of input data that you tried. .