BINF 6203 — Lab 5: Generating annotation for a genome
In this lab exercise, you’ll use the myRAST software to annotate an E. coli genome assembly that you generated a couple of weeks ago. We requested that myRAST be installed on the lab computers, but if it’s not or the version there does not run, you can download the DMG from the link above and put it in your own workspace.
The RAST pipeline combines multiple methods and criteria to produce an annotation for a bacterial genome. It is only one way of annotating a genome, but it has the benefit of being fairly widely accepted in microbial genomics so its use can be easily justified. myRAST is an app that you can install on your own machine which will run the RAST pipeline on a genome file or a set of contigs.
Filtering your contigs.fa
One thing that you may want to do before running myRAST is to get rid of the scrappy little low-quality contigs at the bottom of your assembly. Why? Those contigs are likely to contain fragmentary genes and may confuse your annotation and subsequent interpretation. When you evaluate your assembly using QUAST, by default it cuts off contigs that are smaller than 500, but it doesn’t write you a new filtered contig file. So how could you filter your contigs?
UNIX is full of little surprises and some of those surprises are whole other super-useful niche languages, like sed and awk. bioawk is an add on to the awk language that prepares it to interpret standard sequence data formats like fasta. The command below is an awk recipe for filtering sequences by length. You can try it on your contigs file:
bioawk -c fastx '{ if(length($seq) > 500) { print ">"$name; print $seq }}' contigs.fasta
I am not going to make extravagant claims about whether this will work for you, but I ran it on one of my spades-output contigs.fasta files this morning and it worked. As written, the output of the command will go to the screen. If you want to capture it in a file, you can redirect it by adding ‘> contigs.filtered.fasta’ to the end of the command.
Running myRAST
Once you’ve filtered your contigs (or not) you can open the myRAST app. You’ll see the following window:
You can select ‘Process new genome’ at the bottom to get started. You’ll only have filenames in the window if you’ve already processed some genomes.
RAST may take an hour or so to run with the parameters above. For the purposes of the class, you can use the “Faster” setting. It will still take a while though. At the end of the run, you’ll be able to view the newly annotated genome in the RAST browser and it will also pull up database near neighbors so you can manually compare annotations:
If you figure out where the myRAST output folder lives on your machine, you can do a couple of things — first, you can find all of the information that gets saved, and second, you can add myRAST output folders run on other machines or created by other people. You won’t be able to find that directory until you have completed at least one myRAST run, because that’s when it gets created.
If you want to export information from RAST, you can use the export button. Files can be written out as tables or in FASTA format.
The output formats produced are not standard GFF or GenBank files, but there are ways to get that information from myRAST later.
PEG file:
RAST FASTA
A simple thing that you can do to evaluate how well RAST annotated this e. coli genome compared to the reference annotation is to count. On your filtered sequence, did RAST find the same number of protein encoding genes that exist in the (very well-vetted) published annotation? What about RNA genes?
However, if you want to dig, you will find that hiding under the pretty surface of myRAST is another one of those UNIX surprises.
Here’s what the MyRAST app looks like if you find it in your Finder:
And here is what the MyRAST app looks like if you find it using your UNIX-fu. Short version? That thing that looks like a normal app that you just click on contains subdirectories and accessible binaries. And those binaries are little programs that allow you to talk directly to the RAST server and make all sorts of comparisons.
It is perfectly acceptable to just compare your resulting annotation to the published GenBank annotation for the purposes of this lab. But if you want to push further, here’s an example of how to use just a few of the “hidden” RAST toolkit commands to make a comparison of your output to the existing annotation:
First, set your $PATH variable to put these executables in your UNIX path.
export PATH=$PATH:/Applications/myRAST.app/bin
Set a variable so that your system will recognize the PSEED server:
export SAS_SERVER=PSEED
And then try out some commands.
svr_genbank_to_table Original.gbk > Original.tab svr_seed_to_table RAST_OutputDir > RAST.tab svr_compare_feature_tables Original.tab RAST.tab > Comparison.tab svr_tab2html 'http://rast.nmpdr.org/seedviewer.cgi?page=Annotation&feature=' < Comparison.tab > Comparison.html
Warning: I have not pretested this, because I am still learning all the stuff RAST/SEED can do. However, you are welcome to be a badass and try to get it working. And, you know, 10 points to Gryffindor (or replacement of your lowest quiz grade) if you do. Screenshots are needed as evidence.