BINF 6203: Genome Comparison with Mauve
There are many ways to compare genomes, and these comparisons provide different kinds of information about evolutionary history and shared function.
- First, how do you decide which genes are “the same” across multiple genomes, i.e. orthologs, genes having a common ancestor and related by speciation.
- Second, which parts of the nucleotide sequence of multiple genomes align, irrespective of gene boundaries and orthology relationships.
- Finally, how do we decide what genes “are” and what they do to help interpret the meaning of genomic similarities and differences.
In this exercise, we are going to use the program Mauve to compare several E. coli genomes. These genomes are well-studied and well-annotated, so we do not have to worry too much about applying GO terms and functional annotation. That information is already there.
Mauve is an application with a GUI rather than a command line program. It takes FASTA or GenBank files as input, and can also take files of assembled contigs for comparison to completely closed genomes. In this example, we will download GenBank files for:
- Common molecular biology E.coli strain K-12 MG1655 (ID: U00096.3)
- Commensal e. coli (ID: AP009378.1, AP009379.1 — plasmid)
- Strain O157:H7 Sakai (ID: BA000007.2, AB011549.2, AB011548.2 — plasmids)
- European outbreak strain O104:H4 (ID: CP003301.1, CP003302.1, CP003303.1 — plasmids)
You can get the necessary files from NCBI using their web services. The identifiers given above will return the correct file if you search the Nucleotide division at GenBank. A command line utility called curl will let you fetch the file associated with this URL without using a browser.
curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=U00096.3;rettype=gb;retmode=txt" > U00096.3.gbk
For the purposes of the lab exercise you can just get the first sequence (the main chromosome) for each of the E. coli strains. In real life, you might want to download the concatenated version of the genomes that contains all of the replicating units, from EMBL, to see if there is sequence content in the plasmids that is of interest.
What Mauve is going to try to do, iteratively, is to find the Longest Common Blocks (LCBs) in your genomes. It is essentially aligning by looking for genomic breakpoints (where synteny “breaks” between genomes due to a large insertion or rearrangement). The first step is to go to the file menu in Mauve and select “align using ProgressiveMauve”.
The second step is to choose the four chromosome files that you downloaded and click “align”. Load them in the order U* (lab strain), A* (commensal), B* (O157-H7), C* (euro outbreak 2011).
You don’t really need to set anything but the default parameters here.
The first alignment you will get looks reasonable. Obviously, these E. coli are closely related, and you would expect that there would be some alignable regions among the genomes. In the Mauve analysis, they share many long, alignable blocks of sequence in common.
Explore the Mauve zooming and browsing tools. See if you can find blocks that are common to some of the genomes, but not all. What genes do they contain? Show with screen shots where interesting regions are located, and discuss relevant genes.
Some background information about these genomes that might help you look for interesting genes: The K-12 strain is the “lab strain” of E. coli. The commensal strain is the one that lives harmlessly inside us. The O157:H7 strain is the infamous Jack in the Box strain that caused a whole lot of food poisoning in the mid-late 90s, and the O104 strain is the one that killed a bunch of people in Europe in summer 2011. Use your Google-fu/PubMed-fu to find out what you can about these strains and to see if you can locate any of the pertinent genes in the Mauve browser view. You can actually search for specific genes by name in Mauve, and that might help locate interesting regions.
You can sometimes improve on the arrangement of these alignments. Genome coordinates (i.e. where the numbering starts on a circular genome) can be rather arbitrary. Mauve will reorder your contigs for you if you choose “Move Contigs” under the tools menu. You’ll notice in the alignment that the final genome, the European 2011 outbreak strain, is most different from the other three. You can try to rearrange those two relative to each other using Move Contigs. If you use U* as the reference (the lab strain genome) it makes more sense, since that is the most highly curated genome in the universe. Do you think this step actually improves your alignment between the two genomes?
The move contigs tool is actually intended for unclosed genomes. Try it with the contigs that you assembled from E. coli data a few weeks ago. In that case, show your alignment between U* and your contigs before and after rearrangement. In this case, does the Move Contigs tool produce a more optimal alignment?
You probably don’t need to re-set any parameters from the default.
You can save out ortholog information from your Mauve alignment, although it is not in a format that is easy to interpret. Your ortholog files from Mauve will show a list of regions which are orthologous. The alignment file will show multiple alignments of those regions.
However, you’ll notice that these regions aren’t attached to much annotation that makes it intuitively sensible to figure out why regions are common and different among genomes. To interpret the content of the regions shown in the ortholog file, you could write a python script that takes the coordinate regions and then filters the original GenBank file to find the gene annotation, but this is beyond the scope of this assignment.
Obviously, the amount of work required to RUN Mauve and get an alignment is nowhere near the amount of work required to understand what Mauve is actually telling you biologically, and why. I’ve linked the Mauve paper in Moodle to help you think about what you’re seeing. Interpreting your output in a biologically relevant way will also require some thought.
Use a Mauve alignment to see what you missed in your assembly and annotation exercise a few weeks ago. While you were able to see the features you created in the regular genome browsers, they did not give any insight into what genes might have been missed by your assembly and annotation. Try aligning your best assembly to the E. coli K-12 lab strain genome, and see how your annotation compares to the published annotation. Are there significant regions in the published reference that are missing in your assembly? What about genes? If you can see where genes are missing, are they occurring in clusters in genomic gaps, scattered throughout the genome, or both? Why do you think this is?