BINF 6203: Bacterial genome annotation with prokka
In previous iterations of this course, we’ve used the RAST/SEED servers to perform annotation. RAST is a highly regarded genome annotation pipeline, and you will easily be able to justify using it to reviewers if you are ever writing a bacterial genome paper. However, it can be cumbersome to use the myRAST interface, the command line tools are not particularly well documented, and it can take a day or more to run. So this year we’re going to try a newer method, Torsten Seemann’s Prokka software. If you are interested in learning how to use RAST, the previous tutorial is still available here.
Why use Prokka? First, because in a benchmark test it has been shown to be as or more accurate at reproducing known annotation than RAST or xBASE2 in most annotation categories. Second, because it’s fast and you can run it on a standard laptop in less than ten minutes.
Setup
To install Prokka you need to be in a UNIX environment — either an Apple running OSX or a Linux virtual machine.
What I had to do to get prokka working on my laptop was the following. First, install some perl modules that prokka depends on using the command:
sudo cpan -i Bio::Perl
There may be some errors and warnings in this install, but I ignored them, chose the defaults when prompted, and everything was fine. Once this is installed, run the git clone command to install the current version of prokka:
git clone https://github.com/tseemann/prokka.git
Once the install completes, type “prokka” without any options to see a listing of options. To set up the databases that prokka will use to annotate your genomes, type:
/path/to/prokka --setupdb
So what does prokka do? You should definitely read the paper linked above and not just use it blindly. Basically, it is driving five other programs to run, using sequence database searches to identify the products of the putative genes that they predict, and then integrating the results:
- Prodigal (finds coding sequences)
- RNAmmer (finds ribosomal RNAs)
- Aragorn (finds tRNAs)
- SignalP (finds signal peptides)
- Infernal (finds noncoding RNA)
That gives a pretty good starting coverage of everything that should be found in a prokaryotic genome. (Note: to use SignalP, you’ll have to install it manually because it requires a (free) license agreement and is not included in the prokka package. If you don’t select the gram neg/pos option you do not need it.)
The simplest command line for running prokka is something like:
prokka JY1305contig.fasta --centre U --compliant --force
Here what you are asking prokka to do is annotate the contigs in JY1305contig.fasta, name the contigs as if from a sequencing center named “U”, make the annotations compliant with NCBI standards (necessary if you want other tools to read your files) and force an overwrite of previous iterations of a default-named output directory.
Options
Of course, it’s better to tailor your command line to the data that you have. In my example, JY1305 is a set of contigs we have for Vibrio vulnificus strain JY1305, from a single paired-end Illumina run. I know there are a lot of very short contigs in the output because sequencing was extremely deep meaning that regions which might have bubbles had high enough coverage that an error might be represented often enough to show up as a separate contig. So some of the prokka command line options I can take advantage of are:
Outputs: --outdir [X] Output folder [auto] (default '') --force Force overwriting existing output folder (default OFF) --prefix [X] Filename output prefix [auto] (default '') --locustag [X] Locus tag prefix (default 'PROKKA') --compliant Force Genbank/ENA/DDJB compliance: --genes --mincontiglen 200 --centre XXX (default OFF) --centre [X] Sequencing centre ID. (default '') Organism details: --genus [X] Genus name (default 'Genus') --species [X] Species name (default 'species') --strain [X] Strain name (default 'strain') --plasmid [X] Plasmid name or identifier (default '') Annotations: --kingdom [X] Annotation mode: Archaea|Bacteria|Viruses (default 'Bacteria') --gcode [N] Genetic code / Translation table (needed if --kingdom is set) (default '0') --gram [X] Gram: -/neg +/pos (default '') -- requires SignalP --usegenus Use genus-specific BLAST databases (needs --genus) (default OFF) Computation: --cpus [N] Number of CPUs to use [0=all] (default '8') --mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1') --evalue [n.n] Similarity e-value cut-off (default '1e-06') --rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
So I might run my command with a command line like:
prokka JY1305contig.fasta --outdir JY1305genus --force --compliant --centre UNCC --genus Vibrio --usegenus --cpus 2 --mincontiglen 500
With this command, I am using known information about the genus my target organism fits into. I’m getting rid of the scrappiest short contigs. I’m setting the number of cpus appropriately for my dual core laptop. I could also add information if I knew that my bacteria had an unusual genetic code or if I had signalP installed to find signal peptides specific to gram negative or positive bacteria.
The output that you’ll produce includes many files, but some of the most important are:
- JY1305genus/PROKKA_02052017.faa (FASTA protein file with predicted protein sequences, *.ffn is the corresponding nucleotide sequences)
- JY1305genus/PROKKA_02052017.fna (FASTA nucleotide file with contig sequences)
- JY1305genus/PROKKA_02052017.gbk (GenBank format file)
- JY1305genus/PROKKA_02052017.gff (Gene Feature Format annotation file)
Interpreting your results
You can use UNIX tricks to get a count of how many annotations are created for your genome. Each annotation line in a FASTA file starts with ‘>’. Using the UNIX command:
grep '>' JY1305genus/PROKKA_02052017.ffn | wc
on the *.ffn file will tell you how many total annotated regions there are. The first number in the output of wc is the line count.
4404 17404 201838
Using the command:
grep '>' JY1305genus/PROKKA_02052017.faa | wc
On the *.faa file will tell you how many of those gene annotations are protein coding genes (presumably the rest being noncoding RNAs).
4252 17075 197506
Using the command:
grep 'hypothetical protein' JY1305genus/PROKKA_02052017.faa | wc
will tell you how many of your annotated genes are “hypothetical proteins” i.e. they have no similarity to known sequences.
Visualizing your results (at least kind of)
You can load your contigs and annotations into the Broad Institute’s IGV browser. Download the correct version of the browser to your computer. Then open the *.fna file you created as your genome, and also open the *.gff file as a track. You should be able to select each of your contigs, zoom in on them, and see the locations of individual genes. Hover over the gene names to see their properties and whatever information prokka managed to find out about them.
This is probably not the most useful view of a genome’s annotated content, but it is the classic type of view that genome browsers generally provide. Take some time to try out the navigation tools in IGV so that you feel comfortable using it. Another browser you can try it the Sanger Institute’s Artemis, which will show more detail about gene reading frame and a different style of view.
Your report
Follow the steps outlined above to analyze the E. coli contigs you produced last week. You can try this on more than one set of contigs since it is so quick to run. Calculate counts of the genes found by prokka for your contigs. Compare them with the expected number of genes in E. coli K-12. Describe the results of your analysis. How many genes were found, and how does that compare to the number in the reference genome? How many genes were left unannotated, and how does that compare to the number of “hypothetical proteins” in the reference genome? You can do this by downloading amino acid and FASTA nucleotide sequences for the reference genome from GenBank (these are in the Dropbox, they are the NC_000913 files). Next week we will work with some tools that will allow us to align and compare genomes and also to compare your version of the E. coli genome assembly to the published version.