Under the hood: V. vulnificus JY1305
In this case study, I attempted to improve on the draft assembly of a bacterial genome by combining the original and new data, and using new bioinformatics tools that have become available since we first assembled the genome and later tested a variety of assembly and annotation options.
The data we have available for the genome of Vibrio vulnificus JY1305 is:
- Roche 454 Titanium single end reads
- Illumina paired end reads (one lane)
- Transcriptomes of JY1305 in human serum and artificial seawater
There are many program options and input parameters that can be considered in assembly, and because each run takes a substantial amount of time, we can’t consider all of them.
- Effect of trim parameters on the low quality long read data set
- Effect of digital normalization on the Illumina data set
- Effect of the “careful” flag in SPAdes assembly on each data set
- Effect of different QC of the input on digital normalization outcomes
- Effect of the “careful” flag in SPAdes assembly on each data set
- Effect of combining the long read contigs as either “trusted” or “untrusted” contigs in the Illumina assembly
- Effect of adding Trinity-assembled transcripts as either “trusted” or “untrusted” contigs in the Illumina assembly (or in the Illumina/454 hybrid assembly, as the case may be)
Early on, I settled on using the SPAdes assembler, because it is extremely easy to install and use, there are recommended settings available for both Illumina and Ion Torrent data (which has an error profile similar to the 454 data I have), and in a preliminary test on the largest and highest quality dataset (the paired end Illumina data) SPAdes outperformed Velvet and the CLC Genomics assembler at least in terms of contig number, N50, contig size and other basic quality metrics. SPAdes has been validated as a high-performing method in comparative studies of assembly performance, as well.
The starting data set
At the time of original publication, we had only a small amount of data for Vibrio vulnificus JY 1305 — a single-end Roche 454 data set containing 671521 reads. Since that time we have collected additional paired-end Illumina sequence data as well as transcriptome data from two relevant biological conditions. I wanted to find out whether including these would improve the assembly significantly.
Initially, the sequence data is cleaned with the FastX Toolkit.
Lenient trim:
fastx_trimmer -f 21 -l 450 -i JY1305-454.fastq -o JY1305-454.trim.fastq
fastq_quality_filter -q 22 -p 70 -i JY1305-454.trim.fastq -o JY1305-454.qual.fastq
The sequence is typical of older next-generation sequence in that the quality falls off quite steeply toward the nominal end of the read; this is still true after filtering.
Reads are trimmed to no longer than 450nt, which was the nominal length produced by the 454 FLX Titanium platform on which the sequence was collected. Longer lengths are spurious because the reaction chemistry is unable to produce reads longer than the number of flow cycles specified. The data is then filtered somewhat leniently for quality, leaving around 203,000 reads on average 431nt in length — enough sequence to cover the V. vulnificus JY1305 genome, thought to be similar in size to other closely related strains at about 5.13 Mb, approximately 17-fold.
In our original study, the cleaned sequence was assembled with the Newbler and MIRA assemblers. Her I use the SPAdes assembler to get an up-to-date “baseline” for what kind of assembly can be produced from this data using current methods. I’ll use Ion Torrent settings, since the sequence length and error profile is similar to 454 sequence, use (or not) the SPAdes authors’ recommended “careful” setting, and turn on SPAdes’ read error correction.
spades.py --iontorrent --careful -s JY1305-454.qual.fastq -o 454spades -k 21,33,55,77,99,127
This results in an assembly which is close to the total length we expect for the genome but consists of far more contigs than we’d like. The assemblies were analyzed with QUAST:
~/Applications/quast-2.3/quast.py -o 454quast 454spades/contigs.fasta
Trying a stricter trim
Does throwing away more data improve the assembly?
I could play with the parameters and cut more of the read initially, which would result in shorter reads that pass a stricter quality check (higher percentage of the read above the quality threshhold), but that actually resulted in a larger number of substantially shorter contigs.
fastx_trimmer -f 21 -l 400 -i JY1305-454.fastq -o JY1305-454.trim.fastq fastq_quality_filter -q 22 -p 85 -i JY1305-454.trim.fastq -o JY1305-454.qual.fastq
Comparison of results (454 data)
From left to right (best to worst) the assemblies are lenient, lenient-careful, strict, and strict-careful. As you can see, for this dataset, the choice of a lenient trim (longer input reads) vs. a strict trim (shorter, cleaner input reads) makes a huge difference, but the careful setting makes no difference at all.
Another way that we can potentially evaluate these assemblies is to attempt to annotate them. I ran each of the “careful” assemblies through myRAST to come up with tentative annotation. We don’t have an exact reference genome to compare to, but would assume that a greater number of misassemblies would result in fewer identified genes.
RAST apparently splits contigs from the original assembly at some point during its pipeline because on completion it reports a number of contigs which is different (to varying degrees) from the original. This process, and what a “contig” therefore means in RAST, is not well-described — in either the 2008 RAST paper or in the 2014 SEED paper (see: why I hate black box tools even when they work). The 505 contigs of the better 454 assembly end up as over 1800 contigs following RAST’s mysterious misassembly-correction voodoo, while the 1019 contigs of the worse assembly end up as over 2300. Either way, it’s pretty clear that these assemblies aren’t great and that we should not trust them.
RAST doesn’t provide summary content and function comparisons of two assemblies (we’ll do that later in our own tool, GenoSets) but I can look at the sheer number of predicted features on this assembly (well over 6000) and see that it’s likely going to be in crazy-bad territory.
What new information is available?
Some time after the initial sequencing of V. vulnificus JY1305, we were able to obtain some Illumina paired end sequence for the same genome. We were somewhat naive back then and just threw the genome on its own in a single lane, so we have massive sequence overkill. 60279524 reads of 101 nt — enough for 1186x coverage of a genome whose closest closed and sequenced relatives are around 5.13 Mb. This much sequence is actually likely to be counterproductive in an assembly.
Even trimmed with Trimmomatic, removing residual adapters, removing low-scoring leading and trailing bases as well as applying a sliding-window-based quality screen, and then dropping short reads, we still have 60,017,822 reads in this data set. These were my original sequence cleaning parameters:
java -jar ~/Applications/Trimmomatic-0.32/trimmomatic-0.32.jar PE -phred33 100_1.fastq 100_2.fastq JY1305_forward_paired.fastq JY1305_forward_unpaired.fastq JY1305_reverse_paired.fastq JY1305_reverse_unpaired.fastq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
This data is ridiculously, unnecessarily deep. We had previously handled this problem by simply assembling random subsets of the data, and my former PhD student, Shatavia Morrison, had verified that there wasn’t much to gain by using more than about 10% of the data, which gave us just over 100x coverage. After hearing a talk by Titus Brown, I wanted to try out digital normalization to see if that method could pick a better subset of the data than random picking gave us.
At first, diginorm basically ruined my data set — the quality of the normalized reads was well below the average for the dataset as a while. I spent an aggravatingly long time figuring out why it wasn’t working, namely that I needed to take a different approach to pre-filtering to pre-emptively get rid of the lowest quality reads. I used the following sequence of commands:
java -jar ~/Applications/Trimmomatic-0.32/trimmomatic-0.32.jar PE 100_1.fastq 100_2.fastq JY1305_forward_paired.fastq JY1305_forward_unpaired.fastq JY1305_reverse_paired.fastq JY1305_reverse_unpaired.fastq ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 interleave-reads.py JY1305_forward_paired.fastq JY1305_reverse_paired.fastq -o JY1305_clip_pe.fastq cat JY1305_forward_unpaired.fastq JY1305_reverse_unpaired.fastq > JY1305_clip_se.fastq fastx_trimmer -v -l 85 -i JY1305_clip_se.fastq | fastq_quality_filter -Q64 -q 30 -p 80 -o JY1305_qc_se.fastq -v fastx_trimmer -v -l 85 -i JY1305_clip_pe.fastq | fastq_quality_filter -Q64 -q 30 -p 80 -o JY1305_qc_pe.fastq -v extract-paired-reads.py JY1305_qc_pe.fastq
Note that interleave-reads.py and extract-paired-reads.py are part of the khmer package and I need to be running them in khmerEnv. I went over how to deal with this package previously.
What’s happening in that sequence?
- Clip the adapters with Trimmomatic
- Interleave the reads for the non pair-aware FASTX tools
- Do a length trim to get rid of the part of the reads that look low quality IN THIS DATASET. I ended them at 85 nt
- Do a quality filtering to get rid of reads that have an average quality that is low for this dataset. I didn’t keep anything with an average below 30. This strips out the reads that are going to look “unique” to the digital normalization process because they have a lot of errors and as a result pull down the quality of the normalized output
- Get the reads paired up again with extract-paired-reads.py
Once this is done you’re ready for digital normalization and abundance filtering — there are separate steps for paired end and single end, and at the end you should pay attention and collect all of the single end reads into a single file.
normalize-by-median.py -p -k 20 -C 20 --n_tables 4 --min-tablesize 2e9 -o JY1305.pe.qc.fp.fastq --savetable pek20c20.kh JY1305.pe.qc.fastq normalize-by-median.py -k 20 -C 20 --n_tables 4 --min-tablesize 2e9 -o JY1305.se.qc.fp.fastq --loadtable pek20c20.kh --savetable pek20c20-2.kh JY1305.se.qc.fastq filter-abund.py -C 2 pek20c20-2.kh JY1305.pe.qc.fp.fastq -o JY1305.pe.qc.fp.af.fastq filter-abund.py -C 2 pek20c20-2.kh JY1305.se.qc.fp.fastq -o JY1305.se.qc.fp.af.fastq extract-paired-reads.py JY1305.pe.qc.fp.af.fastq
Note how the kmer tables are saved and used.
Does digital normalization matter?
It definitely matters in terms of time. The un-normalized data took over a day to run on the same computer where the normalized data runs in a couple of hours.
It matters a little in terms of quality. The largest contig doesn’t really change but the bottom end gets cleaned up somewhat. Slightly fewer contigs, slightly less assembled sequence.
We’ll take a look at the impact on annotation in a future post.
Does hybrid assembly help?
There are several ways that long reads or contigs from a separate assembly process can be used in SPAdes.
- We can assemble the Illumina paired end data and add the cleaned 454 data as long reads
- We can assemble the Illumina paired end data and add the assembled 454 data as contigs (either trusted, or untrusted)
- We can also include the assembled (using Trinity) transcriptome data as trusted or untrusted contigs
There are a large number of potential permutations of these options. In the figure below:
JY1305nohyb is the assembly with 454 data treated as long reads (“sanger”) and the Trinity assembly of the transcriptome (which will be discussed in another post) treated as untrusted contigs:
/usr/local/bin/spades.py --12 JY1305.pe.qc.fp.af.keep.fastq -s JY1305.se.qc.fp.af.keep.fastq --sanger JY1305-454.fastq --untrusted-contigs ../JY1305-transcript/trinity_out_dir/Trinity.fasta -o JY1305nohyb-spades
JY1305dnhyb is the assembly with 454 contigs treated as trusted contigs:
/usr/local/bin/spades.py -o JY1305dnhyb-spades --12 JY1305.pe.qc.fp.af.keep.fastq -s JY1305.se.qc.fp.af.keep.fastq --careful --trusted-contigs ../454-lenient/care-454spades/scaffolds.fasta
JY1305wtrans treats the Trinity data as trusted contigs and the assembled 454 as untrusted, and JY1305wtrans2 is the reverse:
/usr/local/bin/spades.py --12 JY1305.pe.qc.fp.af.keep.fastq -s JY1305.se.qc.fp.af.keep.fastq --trusted-contigs ../JY1305-transcript/trinity_out_dir/Trinity.fasta --untrusted-contigs ../454-lenient/care-454spades/contigs.fasta -o JY1305withtrans
How much of a difference does it make?
Of the four permutations tried, the first — using the unassembled 454 reads as longreads and the Trinity transcriptome contigs as untrusted contigs — produced the best assembly by standard measures.
However, looking more closely at the scenarios where both assembled 454 and assembled transcript are used, we can see that their presence joins one large contig that wasn’t joined before, which perhaps explains some of the difference in N50. In a future post, we’ll look at how each of these assemblies annotates with RAST.