GenoSets: Loading data from EMBL
I’m going to be adding some tutorials on how to use GenoSets, our business intelligence-inspired tool for mining multiple genome data sets.
The GenoSets system integrates several fundamental processes in comparative genomic analysis. GenoSets can be used to load tracked data associated with multiple genomes into an OLAP (Online Analytical Processing) data warehouse and to visualize set-based queries of sequence-associated metadata.
In this tutorial, we’ll show you how to load genome sequence data from the EMBL database, run the ancillary software packages BLAST and OrthoMCL that form the basis of set-based querying across genomes, import Gene Ontology classifications for genomes in the dataset, and construct simple set-based queries.
Starting the Program
When you installed GenoSets, the application was placed in your system Applications folder. Click the GenoSets icon to start the program.
Creating a Database
There are five pull-down menus associated with the main GenoSets window. Go to the Database pulldown and select DB Connections. This dialog window will pop up. We’re going to create a database called ecoli.
In the image above, we show creation of the database as the mysql root user, but you should really create a separate user account for your mysql and add a password, because some of the subsequent steps using OrthoMCL will actually expect interaction with a passworded mysql account. In order to create a user and grant appropriate privileges, see the first steps in the mysql tutorial from Digitalocean. Enter the username and password for your mysql in the appropriate fields. Click the “Next” buttons until you have completed the connection. When connection is complete, you should see the connection name at the very bottom right of the main GenoSets window.
Adding Data
Now, we will add some data from EMBL to our database. We’re going to look at eight genomes of Escherichia coli bacteria. E. coli are commonly found in the human digestive tract, but specific strains of these bacteria have adaptations that make them the cause of severe foodborne illness. We will look at laboratory strains of E. coli along O157:H7 strains (the cause of newsworthy mass food poisoning incidents such as the ones that made Jack-in-the-Box infamous) and strains from the 2011 outbreak that sickened and killed people across Europe. We’ll also look at a strain of Shigella dystenteriae that should probably be re-classified as an E. coli strain. Once you load these data into GenoSets, you will be able to construct different types of set queries around these genomes. What’s in both of the O157:H7 strains, but not in the commensal strain?.
To get the strains you want imported into GenoSets, you’ll need to know the INSDC accession numbers that label those sequences in public databases such as EMBL.
If you’re using GenoSets, you’re probably focusing on a comparing a specific set of bacterial genomes, and you’ll start with a list of their identifiers to load.
To find these numbers in NCBI, go to the Genomes page and search using the name of the organism. Depending on what you search for, you’ll appear to get one entry back and if you click through, you’ll come to a page with a taxonomy tree on it. These are not the droids you’re looking for. After you search, switch from the “Overview” tab at the top of the results page to the “Prokaryotes” tab. Under that tab, you’ll see your results in a table, with the INSDC numbers shown. (Why do we use INSDC numbers? Because when we go to fetch your data, we pull it from the European version of Genbank at EMBL. EMBL makes it easier for the program to pull genome data behind the scenes.) Each genome may have multiple accession numbers depending on how many chromosomes and plasmids they have.
- Ancestor molecular biology strain (ATCC 8739) (INSDC: CP000946.1)
- Common molecular biology strain K-12 MG1655 (INSDC: U00096.3)
- Commensal e. coli (INSDC: AP009378.1, AP009379.1)
- Strain O157:H7 Sakai (INSDC: BA000007.2, AB011549.2, AB011548.2)
- Another O157:H7 strain (EC4115) (INSDC: CP001164.1, CP001163.1, CP001165.1)
- Shigella dysenteriae (INSDC: CP000034.1, CP000035.1, CP000640.1)
- European outbreak strain O104:H4 (INSDC: CP003301.1, CP003302.1, CP003303.1)
- Strain O103:H2 (a close relative of O104:H4) (INSDC: AP010958.1, AP010959.1)
To add these data into GenoSets system, you will use the Add Data menu. Choose Load Data. The data loader window will open, giving you several options of methods for loading data. Since you want to load genome sequence files at this point, open the EMBL menu and choose “By Accession ID”:
Next you’ll move to a window where you can enter the list of genome accession IDs. This can be pasted from a text file – if you have a lot of genomes you’ll want to do that. For this exercise, paste the list of IDs for your E. coli genomes – here it is in a concise format.
CP000946.1
U00096.3
AP009378.1
AP009379.1
BA000007.2
AB011549.2
AB011548.2
CP001164.1
CP001163.1
CP001165.1
CP000034.1
CP000035.1
CP000640.1
CP003301.1
CP003302.1
CP003303.1
AP010958.1
AP010959.1
Your import window should look like this:
Once you’ve entered your accession numbers, you’ll be taken to another dialogue, where you can describe the type of information you’re uploading. If you are pulling in published annotation from EMBL, you can leave this description alone. You’d change this description if, for instance, you were uploading a file of independently created annotations to compare to EMBL.
If you look at the bottom right corner of the main GenoSets window, you should see upload status notifications as each of your genomes is accessed and loaded. It will take a few minutes for this step to complete.
Next, we’re going to create a study set and get started on comparing genomic content.