Lab 3 part 2, my solution
Here’s how I approached the E. coli plus Pac Bio data at the command line. This problem was a little sneaky, so don’t worry if you didn’t figure out all of the aspects right away. If you made different choices, you can compare them to my results here as well as to each other. The moral of the story is — always delve into the data you’re handed and find out as much about it as possible!
First I looked at the Illumina data to see what it actually was.
- ERR008613 was from a random genomic PE library from a 200 bp fragment fraction
- ERR022075 was from a random genomic PE library from a 600 bp fragment fraction
What immediately comes to mind when you see that? Hey, let’s use these together, right?
FastQC shows that the read length distributions are right around 100, where they should be.
First, I unpacked the SRA files into fastqs. I split both files into 1 and 2 reads (forward and reverse in a standard genomic sequence library design).
fastq-dump --split-files ~/Dropbox/6203-UNSHARED/Ecoli+PacBio/ERR008613.sra fastq-dump --split-files ~/Dropbox/6203-UNSHARED/Ecoli+PacBio/ERR022075.sra
Then, I downsampled my data using seqtk. Without even getting into fancy coverage statistics, 28 million x 100nt reads / 4.64 mb looks like massive overkill for this genome, about 600x the size of the target sequence. The quality of sequence for the first data set is moderate and the second is high. So 20% of that data should be plenty even if we were trying to assemble this genome from just one library. Since we have two different libraries and some Pac Bio data at our disposal, we could probably get away with sampling down to just 10%. I’m running this on my little laptop, so that’s what I did.
seqtk sample -s 101 ERR008613_1.fastq 0.1 > ERR008613sample_1.fastq seqtk sample -s 101 ERR008613_2.fastq 0.1 > ERR008613sample_2.fastq seqtk sample -s 101 ERR022075_2.fastq 0.1 > ERR022075sample_2.fastq seqtk sample -s 101 ERR022075_1.fastq 0.1 > ERR022075sample_1.fastq
Then I trimmed each read sample with Trimmomatic:
java -jar ~/Applications/Trimmomatic-0.32/trimmomatic-0.32.jar PE ERR008613sample_1.fastq ERR008613sample_2.fastq ecoli200_fwd_paired.fastq ecoli200_fwd_unpaired.fastq ecoli200_rvs_paired.fastq ecoli200_rvs_unpaired.fastq ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 SLIDINGWINDOW:4:15 SLIDINGWINDOW:75:30 MINLEN:75 java -jar ~/Applications/Trimmomatic-0.32/trimmomatic-0.32.jar PE ERR022075sample_1.fastq ERR022075sample_2.fastq ecoli600_fwd_paired.fastq ecoli600_fwd_unpaired.fastq ecoli600_rvs_paired.fastq ecoli600_rvs_unpaired.fastq ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 SLIDINGWINDOW:4:15 SLIDINGWINDOW:75:30 MINLEN:75
And combined the unpaired reads into one file for each library:
cat ecoli200_fwd_unpaired.fastq ecoli200_rvs_unpaired.fastq > ecoli200_unpaired.fastq cat ecoli600_fwd_unpaired.fastq ecoli600_rvs_unpaired.fastq > ecoli600_unpaired.fastq
Then I ran SPAdes, with just the Illumina reads, specifying the reads as coming from two different paired-end libraries. I ran this with the UNIX “time” command so I could see how long it really took:
time spades.py --pe1-1 ecoli200_fwd_paired.fastq --pe1-2 ecoli200_rvs_paired.fastq --pe1-s ecoli200_unpaired.fastq --pe2-1 ecoli600_fwd_paired.fastq --pe2-2 ecoli600_rvs_paired.fastq --pe2-s ecoli600_unpaired.fastq -o illumina-only
Here’s how long it took on my MacBook with its 16GB of RAM:
real 70m14.620s
user 70m40.289s
sys 1m12.920s
This gets the genome into 82 contigs, which, if you read the GAGE-B paper, is not too shabby for a prokaryotic genome with the amount of data we actually used.
Then I added the PacBio long reads (CLR reads) into the assembly. You CAN trim and correct these, but not with Trimmomatic. However, I just left them, and assumed that I would let the read correction during the SPAdes hybrid assembly process work its magic, for now.
time spades.py --pe1-1 ecoli200_fwd_paired.fastq --pe1-2 ecoli200_rvs_paired.fastq --pe1-s ecoli200_unpaired.fastq --pe2-1 ecoli600_fwd_paired.fastq --pe2-2 ecoli600_rvs_paired.fastq --pe2-s ecoli600_unpaired.fastq --pacbio Ecoli_MG1655_pacBioToCA/PacBioCLR/PacBio_10kb_CLR.fastq -o hybrid_clr
This did not take all that much longer than the Illumina-only assembly.
real 79m11.596s
user 79m11.773s
sys 1m34.071s
This gets the sequence into 19 scaffolds — much better than the earlier run.
If I download the reference genome, and the reference GFF file and run genefinding on the new assemblies, with QUAST:
quast.py -R NC_000913.fna -G NC_000913.gff -f illumina-only/scaffolds.fasta hybrid_clr/scaffolds.fasta -o quast
You can see the impact of adding the PacBio CLR reads on additional quality parameters, such as the number of genes recovered, or the existence of misassemblies.
Can you do better?
Seriously, if you get it down to one contig, 10 points to Gryffindor (or credit for your lowest-scoring quiz).
Here are some things I still wonder about:
- Did I downsample too much? Would I have been better off including more Illumina data?
- What happens if we do find a way to clean up the CLR reads before assembling?
- If you look at the SPAdes manual, PacBio CCS reads can simply be added as a single end library (-single). How would this affect the assembly?
- SPAdes can’t assemble the CLR-type reads, but it could in fact assemble the CCS reads (with a combination of longer kmer lengths as you do for Ion Torrent). How good is that data alone?
- The SPAdes manual is remarkably unenlightening on what happens if you use two paired end libraries of different insert lengths together. What would happen if you treated the 600nt library as high-quality mate pairs in ‘fr’ orientation instead of as just another PE library?