BINF 6215: Taming Illumina datasets
As you’re uploading files to Galaxy or waiting for jobs to finish, you may be wondering “how can I make this go faster”? Scientists don’t like to throw out data, but Next Gen Sequencing generates enormous data sets and you don’t necessarily want to use full-sized data just for learning a process. So here are a couple of tips for reducing the size and complexity of data sets, especially when you’re practicing on them. Subsampling is good for just randomly sampling a data set to reduce its size, but digital normalization is a more intelligent method for prioritizing and using your best reads.
Data subsampling tools
Sometimes when you’re learning or testing a process it’s nice to have a smaller data set that is going to process in a short amount of time. To turn paired end Illumina datasets into smaller datasets, you can’t really just cut the file off after a certain number of lines or something crude like that, you need to sample randomly out of the dataset. If you do that, what you can expect to see is a dataset that reflects the overall properties of your main dataset, just at a lower depth. This isn’t necessarily the best way to reduce the size of a dataset — there are normalization tools out there that will shrink a dataset intelligently, preserving more of the signal and less of the noise. But it’s also not an unjustifiable choice if you want to get a look at the general shape of a dataset without processing the whole thing, or if you want to test a pipeline.
A tool you can use to sample out of your data is seqtk. It installs seamlessly on OSX. To get it, type:
git clone https://github.com/lh3/seqtk.git
In the Dropbox that I shared with you containing the V. vulnificus JY1305 data, there are files called JY1305sample_1.fastq and JY1305sample_2.fastq. I created them by using seqtk to sample 20% of the original data.
~/Applications/seqtk/seqtk sample -s100 100_1.fastq 0.2 > JY1305sample_1.fastq
~/Applications/seqtk/seqtk sample -s100 100_2.fastq 0.2 > JY1305sample_2.fastq
In the seqtk command line, the -s argument is followed by the random number seed. It is critical that if you are sampling two files which represent paired reads, you use the same random number seed on both, so that the same lines in the file are sampled. Warning — this is pretty crude, and it may not work if you have done other cleaning to your files. But if you take them right as they come from the sequencer, you’ll get appropriately matched read pairs. You can verify this by looking at the head and tail of the file and making sure the corresponding description lines match up.
There are many other sampling softwares and scripts out there but this one installs cleanly on the Mac and seems to work as advertised.
Another way to reduce the size of your files in a sensible way is to perform digital normalization, using Titus Brown’s khmer package. The philosophy of khmer is, essentially, that it’s taking highly redundant datasets (like Illumina datasets) and simplifying the deBruijn graph, pre-assembly. That makes these big data sets smaller in memory and more assemble-able.
The install is a little different than most of the UNIX packages we’ve touched on, because it’s a python package and execution environment; if you install the virtual environment correctly and activate it, then everything’s pretty straightforward to use. The virtual environment is essentially another layer of command line interpreter in addition to your shell.
Don’t necessarily just Google khmer to figure out how to set this up because there seem to be conflicting versions of the documentation out there. Titus Brown’s most up-to-date tutorial is here. Here are the actual steps for MacOSX.
1. Get the virtualenv package and extract the archive in your Applications directory (your personal one if you are working on a shared computer);
2. Set up the virtual environment;
cd virtualenv-*; python2.7 virtualenv.py ../khmerEnv; cd ..
3. Start the virtual environment;
4. Install khmer and its dependencies while in the active virtual environment;
sudo pip2 install khmer
5. Reactivate the virtual environment (you don’t have to quit out of it first);
Now you should be able to run any of the khmer python scripts. They are located in khmerEnv/bin, if you want to list that out and take a look at what the components are. The khmerEnv directory will end up installed wherever you put virtualenv. So if you have that under ~/Applications, then khmerEnv will be found directly under ~/Applications as well.
khmer, like the Velvet assembler, deals in interleaved paired end data when it’s dealing with paired end data. So before you apply the normalizations, you’ll have to interleave, and before you interleave, you’d really want to do any data cleaning you were going to do, perhaps using a program like Trimmomatic (which is installed on the lab computers). Assuming you did that, you can then interleave your data to prep for normalization:
interleave-reads.py 100_1.fastq 100_2.fastq -o JY1305-interleaved.fastq
This leaves you with a file that has the forward (1) and reverse (2) reads from the same spot intercalated with each other.
And now you can do some normalization. If you want to read about the method, the diginorm paper is here. For genome assembly, the recommended protocol is to normalize and remove errors from the data in three passes.
First pass: normalize-by-median to C (kmer coverage) = 20:
normalize-by-median.py -p -k 20 -C 20 --n_tables 4 --min-tablesize 2e9 -o JY1305-firstpass.fasta JY1305-interleaved.fastq
diginorm is organizing your data based on kmers of size 20 (k), and is cutting off kmer coverage at 20 (C). We have specified that the data is paired end (p). Number of tables and min-tablesize specify how much RAM the program can use to carry out its kmer sorting shenanigans. In practice, if you see your retained data dwindling rapidly down towards 0%, you have probably set this too small. Embiggen it as much as your computer will allow. My computer has 16GB of RAM, so I set it up with 4 2GB tables.
Create a k-mer counting table (script says it wants a presence table but this seems incorrect since the actual description of what to do generates a counting table and it rejects a presence table as input):
load-into-counting.py -k 20 -x 2e9 JY1305table.kh JY1305-firstpass.fastq
Second pass: filter-abund with cutoff C=1 to remove the lowest-abundance kmers:
filter-abund.py -k 20 --n_tables 4 --min-tablesize 1e9 -C 1 -o JY1305-secondpass.fastq JY1305table.kh JY1305-firstpass.fastq
Separate out any orphaned reads left by abundance filtering:
Third pass: normalize-by-median again to C = 5
normalize-by-median.py -p -k 20 -C 5 --n_tables 4 --min-tablesize 1e9 -o JY1305-thirdpass.pe.fastq JY1305-secondpass.fastq.pe
normalize-by-median.py -k 20 -C 5 --n_tables 4 --min-tablesize 1e9 -o JY1305-thirdpass.se.fastq JY1305-secondpass.fastq.se
This all *appears* to work and takes us from something like 17 GB of data down to just under half a GB. The real test is in how it assembles up, though.
Here’s what the original data looks like mapped to a similar genome, V. vulnificus CMCP6:
And here’s what the CLC Genomics assembly of that data looks like:
Here’s what the digitally normalized data looks like mapped to V. v CMCP6:
And here’s what the assembly looks like:
So what does this mean? Maximum sequence depths (usually the result of big local spikes) are reduced from thousands to hundreds of reads. N50 doubled, number of contigs cut nearly in half, largest contig six times the size as previously — and this is with default parameters in CLC. We can do quite a bit better in Velvet, most likely. We might still be able to tweak some parameters in the normalization and improve the outcome. But the take home message is, though, for genome assembly and transcript assembly problems, try digital normalization.