BINF 6203: Chloroplast genome annotation (Part 2)
Now that you’ve learned to manipulate tracks in CLC you can add some externally generated tracks to your genome. One of the easiest chloroplast-specific resources for genome annotation is the CPGAVAS system provided by the Herbal Genomics project. CPGAVAS is going to do a bunch of analysis for you and is going to return information about those types of features we expect on the chloroplast genome. It’s a little bit of a black box — you can just upload your FASTA file and get a GFF3 file back without learning too much about what it is doing.
As a part of your report, you should take a look at the CPGAVAS paper, read about the methods that it is using to produce your annotation, and write a brief description of the CPGAVAS workflow.
The main CPGAVAS page advises that you use the default parameters. You can do that, or if you want to you can try out selecting a restricted group of reference genomes and see what effect it might have on your annotation.
Be sure that you capture your project ID so that you can get your completed results. It will take some time — so don’t start this two hours before your assignment is due!
Don’t just wait for an e-mail to come to you on this one — you’re going to have to check back for results. Mine were there when I got a chance to check about 6 hours after I submitted the job. Go to the “ViewAnno” link on the CPGAVAS main page to check. Once your results are returned, you will have several kinds of information available.
- A GFF3 file containing all your annotations
- A logfile with output of each component that was run
- Links to separate files containing each type of annotation, including possible errors
- A super cool circular graphic map of your genome
You’re going to want to upload each of your types of annotations into CLC as tracks, and then you can manipulate them as shown in the first part of this exercise.
You can do this by importing the GFF3 file (containing all of the types of tracks) into CLC using Import Tracks. BUT THERE IS A TRICK.
In order for CLC to recognize the reference genome, you have to go into your GFF3 file and edit the name of your sequence. CPGAVAS saves your gff as a file named something like 139532332666156.gff. In the file, your sequence (on every single line and annotation) is named something like 139532332666156. However, CLC is calling your genome something else. It’s calling mine NC_007898 consensus. To map the tracks, you have to change every single instance of “139532332666156” to “NC_007898 consensus”. It is case sensitive and space sensitive (i.e. you don’t want to add or remove any spaces when you are making the substitution). There are something like 8000 instances of the pattern in your file and thank UNIX you do not have to change them manually.
Pull up a terminal window. Find the path to your file. Mine was in
/Users/cynthiagibas/Dropbox/6203-2014
Go to that directory and type “vi 139532332666156.gff”. The screen will change a bit and you are now inside the vi editor. You want to make a global search and replace and then save and quit. In vi, type:
:g/139532332666156/s/NC_007898 consensus/
and hit return. (Obviously you should use the number that’s on your CPGAVAS output, not the exact number above which is mine). vi should report how many instances of the pattern have been changed.
then type:
:wq
vi should close and your file is saved and edited. There is an example of an edited file in the dropbox for the genome I worked with in the example but I WANT YOU TO MAKE YOUR OWN. Using global search and replace in vi should be like breathing for a bioinformatician, you don’t even have to think about it it just fixes stuff for you.
Import your modified GFF, select and drag the resulting tracks over to your track list in CLC, et voila:
Interesting questions to consider in your lab report:
- How well is CPGAVAS performing in finding the expected protein sequences?
- Expected RNA sequences?
- What sequences appear to be missing from your genome?
- Is that likely to be because your data is incomplete, or because they’re truly missing?
- If you open up your genome and the reference genome side by side in CLC, can you see if the genes your are missing are likely to fall approximately within a region of your genome where there isn’t much data (assuming that your genome and the reference genome are basically collinear)?