Lab 3 part 2, my solution

Lab 3 part 2, my solution

Here’s how I approached the E. coli plus Pac Bio data at the command line. This problem was a little sneaky, so don’t worry if you didn’t figure out all of the aspects right away. If you made different choices, you can compare them to my results here as well as to each other. The moral of the story is — always delve into the data you’re handed and find out as much about it as possible!

First I looked at the Illumina data to see what it actually was.

  • ERR008613 was from a random genomic PE library from a 200 bp fragment fraction
  • ERR022075 was from a random genomic PE library from a 600 bp fragment fraction

What immediately comes to mind when you see that?  Hey, let’s use these together, right?

FastQC shows that the read length distributions are right around 100, where they should be.

First, I unpacked the SRA files into fastqs. I split both files into 1 and 2 reads (forward and reverse in a standard genomic sequence library design).

fastq-dump --split-files ~/Dropbox/6203-UNSHARED/Ecoli+PacBio/ERR008613.sra
fastq-dump --split-files ~/Dropbox/6203-UNSHARED/Ecoli+PacBio/ERR022075.sra

Then, I downsampled my data using seqtk. Without even getting into fancy coverage statistics, 28 million x 100nt reads / 4.64 mb looks like massive overkill for this genome, about 600x the size of the target sequence. The quality of sequence for the first data set is moderate and the second is high. So 20% of that data should be plenty even if we were trying to assemble this genome from just one library. Since we have two different libraries and some Pac Bio data at our disposal, we could probably get away with sampling down to just 10%. I’m running this on my little laptop, so that’s what I did.

seqtk sample -s 101 ERR008613_1.fastq 0.1 > ERR008613sample_1.fastq
seqtk sample -s 101 ERR008613_2.fastq 0.1 > ERR008613sample_2.fastq
seqtk sample -s 101 ERR022075_2.fastq 0.1 > ERR022075sample_2.fastq
seqtk sample -s 101 ERR022075_1.fastq 0.1 > ERR022075sample_1.fastq

Then I trimmed each read sample with Trimmomatic:

java -jar ~/Applications/Trimmomatic-0.32/trimmomatic-0.32.jar PE ERR008613sample_1.fastq ERR008613sample_2.fastq ecoli200_fwd_paired.fastq ecoli200_fwd_unpaired.fastq ecoli200_rvs_paired.fastq ecoli200_rvs_unpaired.fastq ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 SLIDINGWINDOW:4:15 SLIDINGWINDOW:75:30 MINLEN:75
java -jar ~/Applications/Trimmomatic-0.32/trimmomatic-0.32.jar PE ERR022075sample_1.fastq ERR022075sample_2.fastq ecoli600_fwd_paired.fastq ecoli600_fwd_unpaired.fastq ecoli600_rvs_paired.fastq ecoli600_rvs_unpaired.fastq ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 SLIDINGWINDOW:4:15 SLIDINGWINDOW:75:30 MINLEN:75

And combined the unpaired reads into one file for each library:

cat ecoli200_fwd_unpaired.fastq ecoli200_rvs_unpaired.fastq > ecoli200_unpaired.fastq
cat ecoli600_fwd_unpaired.fastq ecoli600_rvs_unpaired.fastq > ecoli600_unpaired.fastq

Then I ran SPAdes, with just the Illumina reads, specifying the reads as coming from two different paired-end libraries. I ran this with the UNIX “time” command so I could see how long it really took:

time --pe1-1 ecoli200_fwd_paired.fastq --pe1-2 ecoli200_rvs_paired.fastq --pe1-s ecoli200_unpaired.fastq --pe2-1 ecoli600_fwd_paired.fastq --pe2-2 ecoli600_rvs_paired.fastq --pe2-s ecoli600_unpaired.fastq -o illumina-only

Here’s how long it took on my MacBook with its 16GB of RAM:

real    70m14.620s
user    70m40.289s
sys    1m12.920s

This gets the genome into 82 contigs, which, if you read the GAGE-B paper, is not too shabby for a prokaryotic genome with the amount of data we actually used.


Then I added the PacBio long reads (CLR reads) into the assembly. You CAN trim and correct these, but not with Trimmomatic. However, I just left them, and assumed that I would let the read correction during the SPAdes hybrid assembly process work its magic, for now.

time --pe1-1 ecoli200_fwd_paired.fastq --pe1-2 ecoli200_rvs_paired.fastq --pe1-s ecoli200_unpaired.fastq --pe2-1 ecoli600_fwd_paired.fastq --pe2-2 ecoli600_rvs_paired.fastq --pe2-s ecoli600_unpaired.fastq --pacbio Ecoli_MG1655_pacBioToCA/PacBioCLR/PacBio_10kb_CLR.fastq -o hybrid_clr

This did not take all that much longer than the Illumina-only assembly.

real    79m11.596s
user    79m11.773s
sys    1m34.071s

This gets the sequence into 19 scaffolds — much better than the earlier run.ecoli_contigs2

If I download the reference genome, and the reference GFF file and run genefinding on the new assemblies, with QUAST: -R NC_000913.fna -G NC_000913.gff -f illumina-only/scaffolds.fasta hybrid_clr/scaffolds.fasta -o quast

You can see the impact of adding the PacBio CLR reads on additional quality parameters, such as the number of genes recovered, or the existence of misassemblies.


Can you do better?

Seriously, if you get it down to one contig, 10 points to Gryffindor (or credit for your lowest-scoring quiz).

Here are some things I still wonder about:

  • Did I downsample too much? Would I have been better off including more Illumina data?
  • What happens if we do find a way to clean up the CLR reads before assembling?
  • If you look at the SPAdes manual, PacBio CCS reads can simply be added as a single end library (-single). How would this affect the assembly?
  • SPAdes can’t assemble the CLR-type reads, but it could in fact assemble the CCS reads (with a combination of longer kmer lengths as you do for Ion Torrent). How good is that data alone?
  • The SPAdes manual is remarkably unenlightening on what happens if you use two paired end libraries of different insert lengths together. What would happen if you treated the 600nt library as high-quality mate pairs in ‘fr’ orientation instead of as just another PE library?




Comments are closed.