BINF 6203: Tips and tricks for the Midterm
I have asked you to do two different kinds of assembly on the Midterm. Reference assembly, and de novo assembly. CLC Genomics has good tutorials for each kind of assembly, so I am not going to rehash those here. Main points:
- The Comprehensive Mapping Report tool will produce a useful description of your reference-based assembly, and even a graph showing where reads are deep or sparse.
- You can use Extract Consensus Sequence to get a consensus sequence for your newly assembled reads.
But how do you analyze your de novo assembly? This can be a little more tricky. For your contigs from the ab initio assembly, here’s one way to approach it. First, examine your contigs. You probably want to filter them at a slightly more stringent level before you try to compare them to the reference. You can justify different ways of filtering.
- For instance, using the default assembly parameters, I see that quite a few of my contigs are made up of only 2 reads. Maybe I should sort them by number of reads, and get rid of the contigs that are made up of less than 10 reads. That will get rid of a lot of contigs, and probably some that are spurious because there was a bubble that wasn’t resolved.
- Maybe I should sort them by average coverage, and only save a subset of the contigs where the average coverage is over 6, since 6-10x coverage is really what we’re shooting for in an assembly.
- I also see that lots of my contigs are short. Maybe I should sort them by length and only keep the ones that are over 1,000, since that is about 1% of the expected size of the chloroplast genome.
Once you’ve decided how to filter your contigs, you can select them in the contig list and “extract a subset” using the buttons at the bottom of the main window. That creates a new object in the CLC Navigator that contains only the extracted contigs.
Then you’re left with, well, how do I compare the contigs to the reference? You can use BLAST within CLC — turn your subset of contigs into a BLAST database, get your genome reference sequence from GenBank, and BLAST your genome reference sequence against the database of contigs. The BLAST tool isn’t optimized for long sequence alignment, but it gives you a reasonable idea of where things match up. In the BLAST results, you will be able to find sequence ranges within the reference that align to a contig, and you should be able to summarize those pretty easily, especially if you filtered your contigs down to get rid of the noise at the bottom.
If you want to be fancy, you can export your selected sequences out of CLC and use MUMmer to align and plot them. The alignment method in nucmer is optimized for longer sequences, and the show-tiling tool will summarize your contigs for you, together in one figure that summarizes what’s going on. As far as I can tell, CLC does not have this same functionality in the Genomics module, although it may be available in the finishing module. Usage information for MUMmer.
I was able to install MUMmer on my MacBook Pro running OSX 10.8.5 without making any changes. Getting it to plot was a slightly different story. But the pathway to basic command line/text mode functionality is outlined below.
First, if you don’t have it already, get XCode from the App Store. Then, go into the Applications folder, double click the XCode app, and run it to get updates (this is a good idea to do periodically anyway). And while you’re at it, at a command line (you need superuser for this of course) type:
sudo xcodebuild -license
Just accepting the license when you install the XCode package apparently doesn’t have the same effect. I know, right?
Next, just download the current version of MUMmer, put it in a convenient place like your Applications folder, and type:
sudo make install
You will now have executables. As an example, I ran MUMmer with two input files. One is the chloroplast reference sequence in fasta format, the other is an exported file from CLC containing my selected group of contigs. Exported as a FASTA format file, of course. I did some fairly brutal filtering on the example I ran — you can filter your contigs a little or not at all, and MUMmer will still produce a really convenient summary of what parts of the genome you covered at the end. CLC unfortunately does not show this summary except if we spend many extra dollars to buy their Finishing Module.
First I’m going to align them:
~/Applications/MUMmer3.23/nucmer tomatoref.fasta contigs.fasta
Note: name your contigs file something that does not have a bunch of spaces in the filename. A long filename with spaces will cause MUMmer to fail. You’ll get an output called out.delta.
Then you can use the show-coords program to produce a table containing the coordinates of where the contigs map to the reference:
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [TAGS] ===================================================================================== 54712 66102 | 11390 1 | 11391 11390 | 99.99 | gi|544163592|ref|NC_007898.3| mcf-test_(single)_trimmed_contig_1 8460 40461 | 32009 13 | 32002 31997 | 99.98 | gi|544163592|ref|NC_007898.3| mcf-test_(single)_trimmed_contig_2 1 6453 | 6450 1 | 6453 6450 | 99.91 | gi|544163592|ref|NC_007898.3| mcf-test_(single)_trimmed_contig_4 85876 94763 | 6450 15336 | 8888 8887 | 99.99 | gi|544163592|ref|NC_007898.3| mcf-test_(single)_trimmed_contig_4 146575 155461 | 15336 6451 | 8887 8886 | 99.99 | gi|544163592|ref|NC_007898.3| mcf-test_(single)_trimmed_contig_4 94746 94959 | 1 214 | 214 214 | 98.13 | gi|544163592|ref|NC_007898.3| mcf-test_(single)_trimmed_contig_5 105781 106049 | 911 1179 | 269 269 | 97.03 | gi|544163592|ref|NC_007898.3| mcf-test_(single)_trimmed_contig_5 135289 135557 | 1179 911 | 269 269 | 97.03 | gi|544163592|ref|NC_007898.3| mcf-test_(single)_trimmed_contig_5 146379 146592 | 214 1 | 214 214 | 98.13 | gi|544163592|ref|NC_007898.3| mcf-test_(single)_trimmed_contig_5 100070 100179 | 15655 15760 | 110 106 | 95.45 | gi|544163592|ref|NC_007898.3| mcf-test_(single)_trimmed_contig_6 106032 121727 | 1 15695 | 15696 15695 | 99.91 | gi|544163592|ref|NC_007898.3| mcf-test_(single)_trimmed_contig_6 129851 135306 | 5456 1 | 5456 5456 | 100.00 | gi|544163592|ref|NC_007898.3| mcf-test_(single)_trimmed_contig_6 141159 141268 | 15760 15655 | 110 106 | 95.45 | gi|544163592|ref|NC_007898.3| mcf-test_(single)_trimmed_contig_6 94746 106154 | 1 11408 | 11409 11408 | 99.99 | gi|544163592|ref|NC_007898.3| mcf-test_(single)_trimmed_contig_9 135184 146592 | 11408 1 | 11409 11408 | 99.99 | gi|544163592|ref|NC_007898.3| mcf-test_(single)_trimmed_contig_9 6559 8351 | 1 1771 | 1793 1771 | 98.77 | gi|544163592|ref|NC_007898.3| mcf-test_(single)_trimmed_contig_11
You can see from this which regions of the reference have a corresponding contig, which contigs don’t actually align to the reference, etc. For a more compressed view, you can use show-tiling to show how the maximum amount of the reference genome can be covered with the fewest contigs:
CCI24R9XYLAWS:Data gibasmac$ ~/Applications/MUMmer3.23/show-tiling out.delta >gi|544163592|ref|NC_007898.3| 155461 bases 6559 8329 130 1771 100.00 98.77 + mcf-test_(single)_trimmed_contig_11 8460 40468 14243 32009 99.96 99.98 - mcf-test_(single)_trimmed_contig_2 54712 66101 39930 11390 100.00 99.99 - mcf-test_(single)_trimmed_contig_1 106032 121791 33670 15760 99.59 99.91 + mcf-test_(single)_trimmed_contig_6
Obviously in the extremely restricted example set I’ve used, I’ve missed covering quite a bit of the reference genome with my contigs.