Using the Ontologizer (part 1)
I have a list of differentially expressed genes that I need to analyze. One way to get an idea of what’s being differentially expressed is to compare this gene list to the entire gene content of the organism, and see which categories of genes are enriched in the differentially expressed genes. This kind of analysis could be done using various gene classifications but the Gene Ontology is our frame of reference.
In order to view graphics generated by the Ontologizer, you’ll need to have graphviz installed on your computer, specifically a program in the package called dot. Note: you need graphviz for other purposes as well, like installing Taverna. The easiest way to install it is via a package manager. I have been using Homebrew lately. It’s foolproof (unless you’re a sufficiently talented fool). In which case you’d just type “brew install graphviz” and be done with it. There are plenty of useful NGS analysis tools on the homebrew/science tap.
Anyway. I downloaded the Ontologizer.jar file and placed it in a directory ~/Applications/Ontologizer.
To run the Ontologizer at the command line you need at minimum three things: a gene list, in the form of a list of locus identifiers: (your study set, -s flag on the command line), e.g. these identifiers from Vibrio vulnificus CMC P6 which were pulled from EdgeR output:
The second thing you need is a file that associates Gene Ontology terms with the locus identifiers for the source genome that your gene list is from (-a flag on the command line). Note: in a case where the set of genes you are surveying is essentially identical to the list of genes in the association file, you can use the association file as the population file as well (-p flag on the command line, equivalent to selecting all genes in the GUI).
Where do you get the association file? Well, EMBL provides a database called UniProt-GOA, where you can download GO associations for your genome. If you’re going to grab an association file via FTP by clicking the first link on that page, you’ll end up with a big set of subdirectories to browse through. You want to go into the proteomes directory, find the most recently dated subdirectory, and search in there for the files you need. There are better (programmatic) ways to do this, but here’s what it would look like if you opened up the FTP site as Guest on your laptop, and searched for Vibrio vulnificus in the proteomes/20140218 subdirectory:
The inside of one of those *.goa files looks like this:
!Project_name: UniProt GO Annotation (UniProt-GOA)
!Contact Email: firstname.lastname@example.org
!Date downloaded from the QuickGO browser: 20140219
!Filtering parameters selected to generate file: GAnnotation?tax=216895&count=25&select=normal&advanced=&termUse=ancestor&slimTypes=IPO%3D
UniProtKB E7MCC0 VV1_3252 GO:0003824 GO_REF:0000002 IEA InterPro:IPR003607 F Putative uncharacterized protein E7MCC0_VIBVU|VV1_3252 protein taxon:216895 20140215 InterPro
UniProtKB E7MCC0 VV1_3252 GO:0008081 GO_REF:0000002 IEA InterPro:IPR006674 F Putative uncharacterized protein E7MCC0_VIBVU|VV1_3252 protein taxon:216895 20140215 InterPro
UniProtKB E7MCC0 VV1_3252 GO:0008152 GO_REF:0000002 IEA InterPro:IPR003607 P Putative uncharacterized protein E7MCC0_VIBVU|VV1_3252 protein taxon:216895 20140215 GOC
The final thing that you need is an OBO file for the Gene Ontology. It’s the formal description for the GO. The Ontologizer needs it as a frame of reference, essentially. (-g flag on your command line)
To run the manually initiated version, in the directory where you have the Ontologizer installed, type the following at the UNIX command line prompt:
java -Xmx1G -XstartOnFirstThread -cp swt.jar:ontologizer-gui-with-dependencies.jar ontologizer.gui.swt.Ontologizer
If you add a new project, you will be prompted to add your *.obo file and annotation file in the first dialogue, and then to choose your study set in the second. Then you simply select your study set and click “Ontologize”. The default analysis method that gets used is Parent-Child-Union, and you should definitely use a multiple testing correction (try Bonferroni to start with). You’ll get a list of gene ontology terms marked green if they are part of the Biological Process hierarchy, yellow if they are part of Molecular Function, and pink if they are part of the Cellular Component hierarchy. You can select a p-value threshold for significance (try 0.05) and then write the selected genes out to files or (supposedly) graph them.
But today, I’m going to stop here, because despite having graphviz installed properly, the Ontologizer isn’t displaying my graphs, and it’s time to do some serious troubleshooting. More to come…