Pangenome analysis with PanX
PanX is a program for pangenome analysis and production of core genome phylogenies. The PanX analysis tools are available as a github package for custom genome analyses, and also as a web server. The PanX authors make a collection of precomputed comparisons available here.
One of the demonstration datasets available at the PanX site is a collection of prochlorococcus marinus genomes sequenced by Biller et al (2014). In lecture last week, we also looked at a study of prochlorococcus by Kashtan et al (2014) that used an LCB alignment approach to identify common genomic backbones belonging to specific clades. You can use accession numbers found in PanX to search NCBI and learn more about the content of individual strain genomes.
Examine the collection of prochlorococcus available to you in PanX and try to answer the following questions:
How many core genes and accessory genes are present in this collection? What is the average number of genes in each genome (you can probably approximate this by looking at a few examples)?
What is the most diverse gene in the core genome? What about the most diverse gene in the core genome that does not have any duplications? Examine the phylogenetic trees for these individual genes and describe the relationships between them and the PanX species tree for the whole dataset.
What are the least diverse genes in the core genome? Look at the top five non-diverse genes. Are these genes that you would expect to be conserved in cyanobacteria, and why? If you don’t know what they do, look them up!
Compare the trees for the non-diverse ribosomal genes to the “species” tree produced by PanX. Do they have the same topology? If not, can you identify congruent regions between the trees? Take a look at the sequence alignments of these genes. Are the snp variants between the genomes scattered randomly, or are there obvious patterns of mutation?
Examine some genes in the accessory genome. It’ll be hard to decide exactly what to look at here, but for instance, take a look at psbM and psbF. What do these genes do? Are these genes in all the genomes? Do they have the same or different patterns of presence/absence in the core genome (i.e. are they present or absent in the same branches)? See if you can find a literature reference that explains what the “psb” system these genes are part of does, and what a complete set of psb genes would look like.
There is a huge amount of information to sift through here, and you’re not going to become experts in prochlorococcus biology overnight. Find three genes or biological functions of interest — you can use the discussion in the Kashtan paper as a starting point, or focus on something you know about from your previous studies, or a well-known topic like antibiotic resistance or circadian rhythm. Identify relevant genes, and their pattern of presence and absence in the prochlorococcus genomes, as well as the evolutionary relationships among the species where they are present.