BINF 6203: Annotation with RAST
In this lab exercise, you’ll use the myRAST software (or the RAST website) to annotate an assembled genome. Try this first with the E. coli genome assembly that you generated a couple of weeks ago (your contigs.fasta file) and then try to annotate a sequenced but poorly characterized genome from the NCBI microbial genomes collection. E. coli will have tons of genomic neighbors with well-documented annotation, while annotating a genome from among the uncategorized (or poorly categorized) environmental strains in the NCBI database gives you the opportunity to explore a more difficult problem. MyRAST is easy to install, but if you have trouble with it you can create an account on the RAST/SEED website.
The RAST pipeline combines multiple methods and criteria to produce an annotation for a bacterial genome. It is only one way of annotating a genome, but it has the benefit of being fairly widely accepted in microbial genomics so its use can be easily justified.
Filtering your contigs.fa (if using your own assembly)
One thing that you may want to do before running myRAST is to get rid of the scrappy little low-quality contigs at the bottom of your assembly. Why? Those contigs are likely to contain fragmentary genes and may confuse your annotation and subsequent interpretation. When you evaluate your assembly using QUAST, by default it cuts off contigs that are smaller than 500, but it doesn’t write you a new filtered contig file. So how could you filter your contigs?
UNIX is full of little surprises and some of those surprises are whole other super-useful niche languages, like sed and awk. bioawk is an add on to the awk language that prepares it to interpret standard sequence data formats like fasta. The command below is an awk recipe for filtering sequences by length. You can try it on your contigs file:
bioawk -c fastx '{ if(length($seq) > 500) { print ">"$name; print $seq }}' contigs.fasta
I am not going to make extravagant claims about whether this will work for you, but it is not hard to add bioawk to your system and it worked for me on my Mac laptop. As written, the output of the command will go to the screen. If you want to capture it in a file, you can redirect it by adding ‘> contigs.filtered.fasta’ to the end of the command.
Running RAST on the server
If you run RAST on the server as shown during lecture, you’ll need to create an account first.
In your report, you should consider the following:
-
What are the closely related neighbor genomes to your genome (if you used an unknown) or, if you used e. coli which strains are the closest neighbors?
-
If you walk the genome in the browser, how dense is the gene coverage? (Remember, bacterial genomes tend to have gene densities at around 85%, so you’re visually evaluating if there’s a lot of un-called space that there shouldn’t be).
-
If you walk the genome in the comparison browser, are the gene matches complete or partial?
-
How many of the called genes in the genome are “hypothetical” and not assigned a function?
If you do choose to try myRAST, here’s how it should work:
Running myRAST
Once you’ve filtered your contigs (or not) you can open the myRAST app. You’ll see the following window:
You can select ‘Process new genome’ at the bottom to get started. You’ll only have filenames in the window if you’ve already processed some genomes.
RAST may take an hour or so to run with the parameters above. For the purposes of the class, you can use the “Faster” setting. It will still take a while though. At the end of the run, you’ll be able to view the newly annotated genome in the RAST browser and it will also pull up database near neighbors so you can manually compare annotations:
If you figure out where the myRAST output folder lives on your machine, you can do a couple of things — first, you can find all of the information that gets saved, and second, you can add myRAST output folders run on other machines or created by other people. You won’t be able to find that directory until you have completed at least one myRAST run, because that’s when it gets created.
If you want to export information from RAST, you can use the export button. Files can be written out as tables or in FASTA format.
Warning: the output formats produced are not standard GFF or GenBank files. These files can be produced from myRAST but it’s done using a poorly documented command line interface.
PEG file:
RAST FASTA
A simple thing that you can do to evaluate how well RAST annotated this e. coli genome compared to the reference annotation is to count. On your filtered sequence, did RAST find the same number of protein encoding genes that exist in the (very well-vetted) published annotation? What about RNA genes?