Lab 2: Genome Annotation and Comparison
Why use Prokka? Because in a benchmark test it has been shown to be as or more accurate at reproducing known annotation than RAST or xBASE2 in most annotation categories. It can run on a normal laptop for a small genome and is easy and convenient to use.
Setup
Prokka is not installed on the hpc-student cluster yet, so we’re going to learn how to install a program for your own use on the cluster. Jon Halter in University Research Computing troubleshot the process for us, and it works fine for me as of this week.
First, you’ll need the mambaforge module loaded:
module load mambaforge/4.14
Then use mamba to install prokka:
mamba create -n prokka -c conda-forge -c bioconda -c defaults prokka
Finally, you can check to see that it worked and where the application is located:
mamba activate prokka
which prokka
From there you should be able to follow the instructions below, which will also work if you choose to install prokka on your own laptop for some reason.
Once the install completes, type “prokka” without any options to see a listing of options. To set up the databases that prokka will use to annotate your genomes, type:
prokka --setupdb
prokka --listdb
So what does prokka do? You should definitely read the prokka paper linked in Canvas and not just use it blindly. Prokka’s Github page also has a lot of information about what the program is doing behind the scenes, its possible settings, and command lines. Basically, it is driving five or more other programs to run, using sequence database searches to identify the products of the putative genes that they predict, and then integrating the results:h
- Prodigal (finds coding sequences)
- HMMER (finds HMM matches — the HAMAP database is available in the default installation but you could add others if you want them)
- RNAmmer (finds ribosomal RNAs)
- Aragorn (finds tRNAs)
- SignalP (finds signal peptides)
- Infernal (finds noncoding RNA)
That gives a pretty good starting coverage of everything that should be found in a prokaryotic genome. (Note: to use SignalP, you’ll have to install it manually because it requires a (free) license agreement and is not included in the prokka package. If you don’t select the gram neg/pos option you do not need it, so just leave it alone for now.)
The simplest command line for running prokka is something like:
prokka JY1305contig.fasta --centre U --compliant --force
Here what you are asking prokka to do is annotate the contigs in JY1305contig.fasta, name the contigs as if from a sequencing center named “U”, make the annotations compliant with NCBI standards (necessary if you want other tools to read your files) and force an overwrite of previous iterations of a default-named output directory.
Options
Of course, it’s better to tailor your command line to the data that you have. In my example, JY1305 is a set of contigs we have for Vibrio vulnificus strain JY1305, from a single paired-end Illumina run. I know there are a lot of very short contigs in the output because sequencing was extremely deep meaning that regions which might have bubbles had high enough coverage that an error might be represented often enough to show up as a separate contig. So some of the prokka command line options I can take advantage of are:
Outputs:
--outdir [X] Output folder [auto] (default '')
--force Force overwriting existing output folder (default OFF)
--prefix [X] Filename output prefix [auto] (default '')
--locustag [X] Locus tag prefix (default 'PROKKA')
--compliant Force Genbank/ENA/DDJB compliance:
--genes
--mincontiglen 200
--centre [X] Sequencing centre ID. (default '')
Organism details:
--genus [X] Genus name (default 'Genus')
--species [X] Species name (default 'species')
--strain [X] Strain name (default 'strain')
--plasmid [X] Plasmid name or identifier (default '')
Annotations:
--kingdom [X] Annotation mode: Archaea|Bacteria|Viruses (default 'Bacteria')
--gcode [N] Genetic code / Translation table (needed if --kingdom is set) (default '0')
--gram [X] Gram: -/neg +/pos (default '')
--hmm
--usegenus Use genus-specific BLAST databases (needs --genus) (default OFF)
Computation:
--cpus [N] Number of CPUs to use [0=all] (default '8')
--mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
--evalue [n.n] Similarity e-value cut-off (default '1e-06')
--rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
So I might run my command with a command line like:
prokka JY1305contig.fasta --outdir JY1305genus --force --compliant --centre UNCC --genus Vibrio --usegenus --cpus 2 --mincontiglen 500
With this command, I am using known information about the genus my target organism fits into. I’m getting rid of the scrappiest short contigs. I’m setting the number of cpus appropriately for my dual core laptop. I could also add information if I knew that my bacteria had an unusual genetic code or if I had signalP installed to find signal peptides specific to gram negative or positive bacteria.
The output that you’ll produce includes many files, but some of the most important are:
- JY1305genus/PROKKA_02052017.faa (FASTA protein file with predicted protein sequences, *.ffn is the corresponding nucleotide sequences)
- JY1305genus/PROKKA_02052017.fna (FASTA nucleotide file with contig sequences)
- JY1305genus/PROKKA_02052017.gbk (GenBank format file)
- JY1305genus/PROKKA_02052017.gff (Gene Feature Format annotation file)
Interpreting your results
You can use UNIX tricks to get a count of how many annotations are created for your genome. Each annotation line in a FASTA file starts with ‘>’. Using the UNIX command:
grep '>' JY1305genus/PROKKA_02052017.ffn | wc
on the *.ffn file will tell you how many total annotated regions there are. The first number in the output of wc is the line count.
4404 17404 201838
Using the command:
grep '>' JY1305genus/PROKKA_02052017.faa | wc
On the *.faa file will tell you how many of those gene annotations are protein coding genes (presumably the rest being noncoding RNAs).
4252 17075 197506
Using the command:
grep 'hypothetical protein' JY1305genus/PROKKA_02052017.faa | wc
will tell you how many of your annotated genes are “hypothetical proteins” i.e. they have no similarity to known sequences.
Visualizing your results
You can load your contigs and annotations into the Broad Institute’s IGV browser, or in the Sanger Institute’s artemis browser. In the past, I have been able to install either of these on my laptop without a significant amount of trouble. They will need an up to date java installation.
In IGV, open the *.fna file you created as your genome, and also open the *.gff file as a track. You should be able to select each of your contigs, zoom in on them, and see the locations of individual genes. Hover over the gene names to see their properties and whatever information prokka managed to find out about them.
In artemis, you’ll get a somewhat different view of the same data. To start an art instance from the command line, just include the name of the gff file you are interested in on your command line:
art mydir/mygenome.gff
If that doesn’t work, which it may not if there’s anything nonstandard in the annotation that was created, you can open artemis just by typing ‘art’, and then read in first the genome sequence file and then load the GFF file as “Entries”.
Either of these browsers provide an overview of a genome’s annotated content, and you can decide which you like best as you go along. Artemis will show more detail about gene reading frame and a different style of view, which you can see in the featured image for this post.
Your report
Follow the steps outlined above to analyze the E. coli contigs you produced last week. You can try this on more than one set of contigs since it is so quick to run, at least unless you get into the fancier pattern searches. Calculate counts of the genes found by prokka for your contigs. Compare them with the expected number of genes in E. coli K-12. Describe the results of your analysis. How many genes were found, and how does that compare to the number in the reference genome? How many genes were left unannotated, and how does that compare to the number of “hypothetical proteins” in the reference genome? You can do this by downloading amino acid and FASTA nucleotide sequences for the reference genome from GenBank (these are in the Dropbox, they are the NC_000913 files). Next week we will work with some tools that will allow us to align and compare genomes and also to compare your version of the E. coli genome assembly to the published version.