Genome annotation with prokka
Why use Prokka? First, because in a benchmark test it has been shown to be as or more accurate at reproducing known annotation than RAST or xBASE2 in most annotation categories. Second, because it’s fast and you can run it on a standard laptop within a short time, while sending your genome out for annotation to the RAST server can take a day or so to return. If you’re interested in learning to use the RAST server, search the site — there’s an older annotation tutorial on here that will give you the basics.
To install Prokka you need to be in a UNIX environment — either an Apple running OSX or a Linux virtual machine. Prokka installs with Homebrew as well as with many other package managers for other systems. Install it, and while you’re at it, install the Sanger Institute’s artemis annotation browser as well.
brew install prokka
brew install artemis
Once the install completes, type “prokka” without any options to see a listing of options. To set up the databases that prokka will use to annotate your genomes, type:
So what does prokka do? You should definitely read the paper linked above and not just use it blindly. Prokka’s Github page also has a lot of information about what the program is doing, its possible settings, and command lines. Basically, it is driving five or more other programs to run, using sequence database searches to identify the products of the putative genes that they predict, and then integrating the results:h
- Prodigal (finds coding sequences)
- HMMER (finds HMM matches — the HAMAP database is available in the default installation but you could add others if you want them)
- RNAmmer (finds ribosomal RNAs)
- Aragorn (finds tRNAs)
- SignalP (finds signal peptides)
- Infernal (finds noncoding RNA)
That gives a pretty good starting coverage of everything that should be found in a prokaryotic genome. (Note: to use SignalP, you’ll have to install it manually because it requires a (free) license agreement and is not included in the prokka package. If you don’t select the gram neg/pos option you do not need it.)
The simplest command line for running prokka is something like:
prokka JY1305contig.fasta --centre U --compliant --force
Here what you are asking prokka to do is annotate the contigs in JY1305contig.fasta, name the contigs as if from a sequencing center named “U”, make the annotations compliant with NCBI standards (necessary if you want other tools to read your files) and force an overwrite of previous iterations of a default-named output directory.
Of course, it’s better to tailor your command line to the data that you have. In my example, JY1305 is a set of contigs we have for Vibrio vulnificus strain JY1305, from a single paired-end Illumina run. I know there are a lot of very short contigs in the output because sequencing was extremely deep meaning that regions which might have bubbles had high enough coverage that an error might be represented often enough to show up as a separate contig. So some of the prokka command line options I can take advantage of are:
--outdir [X] Output folder [auto] (default '')
--force Force overwriting existing output folder (default OFF)
--prefix [X] Filename output prefix [auto] (default '')
--locustag [X] Locus tag prefix (default 'PROKKA')
--compliant Force Genbank/ENA/DDJB compliance:
--centre [X] Sequencing centre ID. (default '')
--genus [X] Genus name (default 'Genus')
--species [X] Species name (default 'species')
--strain [X] Strain name (default 'strain')
--plasmid [X] Plasmid name or identifier (default '')
--kingdom [X] Annotation mode: Archaea|Bacteria|Viruses (default 'Bacteria')
--gcode [N] Genetic code / Translation table (needed if --kingdom is set) (default '0')
--gram [X] Gram: -/neg +/pos (default '')
--usegenus Use genus-specific BLAST databases (needs --genus) (default OFF)
--cpus [N] Number of CPUs to use [0=all] (default '8')
--mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
--evalue [n.n] Similarity e-value cut-off (default '1e-06')
--rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
So I might run my command with a command line like:
prokka JY1305contig.fasta --outdir JY1305genus --force --compliant --centre UNCC --genus Vibrio --usegenus --cpus 2 --mincontiglen 500
With this command, I am using known information about the genus my target organism fits into. I’m getting rid of the scrappiest short contigs. I’m setting the number of cpus appropriately for my dual core laptop. I could also add information if I knew that my bacteria had an unusual genetic code or if I had signalP installed to find signal peptides specific to gram negative or positive bacteria.
The output that you’ll produce includes many files, but some of the most important are:
- JY1305genus/PROKKA_02052017.faa (FASTA protein file with predicted protein sequences, *.ffn is the corresponding nucleotide sequences)
- JY1305genus/PROKKA_02052017.fna (FASTA nucleotide file with contig sequences)
- JY1305genus/PROKKA_02052017.gbk (GenBank format file)
- JY1305genus/PROKKA_02052017.gff (Gene Feature Format annotation file)
Interpreting your results
You can use UNIX tricks to get a count of how many annotations are created for your genome. Each annotation line in a FASTA file starts with ‘>’. Using the UNIX command:
grep '>' JY1305genus/PROKKA_02052017.ffn | wc
on the *.ffn file will tell you how many total annotated regions there are. The first number in the output of wc is the line count.
4404 17404 201838
Using the command:
grep '>' JY1305genus/PROKKA_02052017.faa | wc
On the *.faa file will tell you how many of those gene annotations are protein coding genes (presumably the rest being noncoding RNAs).
4252 17075 197506
Using the command:
grep 'hypothetical protein' JY1305genus/PROKKA_02052017.faa | wc
will tell you how many of your annotated genes are “hypothetical proteins” i.e. they have no similarity to known sequences.
Visualizing your results
You can load your contigs and annotations into the Broad Institute’s IGV browser, or in the Sanger Institute’s artemis browser. In IGV, open the *.fna file you created as your genome, and also open the *.gff file as a track. You should be able to select each of your contigs, zoom in on them, and see the locations of individual genes. Hover over the gene names to see their properties and whatever information prokka managed to find out about them.
In artemis, you’ll get a somewhat different view of the same data. To start an art instance from the command line, just include the name of the gff file you are interested in on your command line:
If that doesn’t work, which it may not if there’s anything nonstandard in the annotation that was created, you can open artemis just by typing ‘art’, and then read in first the genome sequence file and then load the GFF file as “Entries”.
Either of these browsers provide an overview of a genome’s annotated content, and you can decide which you like best as you go along. Artemis will show more detail about gene reading frame and a different style of view, which you can see in the featured image for this post.
Follow the steps outlined above to analyze the E. coli contigs you produced last week. You can try this on more than one set of contigs since it is so quick to run, at least unless you get into the fancier pattern searches. Calculate counts of the genes found by prokka for your contigs. Compare them with the expected number of genes in E. coli K-12. Describe the results of your analysis. How many genes were found, and how does that compare to the number in the reference genome? How many genes were left unannotated, and how does that compare to the number of “hypothetical proteins” in the reference genome? You can do this by downloading amino acid and FASTA nucleotide sequences for the reference genome from GenBank (these are in the Dropbox, they are the NC_000913 files). Next week we will work with some tools that will allow us to align and compare genomes and also to compare your version of the E. coli genome assembly to the published version.