BINF 2111: Genome Horoscope, Inc.
Last week, you made a script that takes a “panel” of SNPs where a particular variant allele is associated with blue eyes in Europeans, parses an individual’s 23 and Me results to check their SNP status for each site in the panel, and reports back to the user.
This week, I want you to modify your “Genome Horoscope”project with a couple of goals in mind:
- Define functions that do distinct parts of the script’s work.
- Make the script a little more generalizable, so it can tell users about more than just one trait or condition. You may want to have two scripts — one that parses the available GWAS data and makes a “SNP panel” file for a new trait, and another that parses the SNP panel and the user’s 23 and Me file and presents them with a result.
There’s nothing really hard here coding-wise — just parsing a new kind of file and putting its information into a form you can manipulate. The hard stuff comes when you need to start to imagine use cases and make your script do enough checking to avoid constant failures.
Imagine your scripts for the following use case: you work at a startup company that is eventually going to provide Genome Horoscopes to people via web apps. You probably don’t want to just provide sensitive information to people all willy-nilly with no fact checking or supportive information about serious risks (hello FDA), so you’re going to prototype on a few non-controversial traits. You need to make two things easy. The first thing is the User experience — parsing the user’s data and giving them an answer. The second thing you need to make easy is what you might call a Curator experience — you need to make it easy to create a new “SNP panel” for any trait or condition (and eventually, to revise existing panels when new information pops up in the GWAS database).
Basic requirements for this week
- One script parses the GWAS database and pulls out all needed information from SNPs associated with a trait to make a “SNP panel”.
- You’ll need to design a SNP panel file format to put these in.
- Second script checks the user’s genome against your new SNP panels, offering the user a choice of which trait to look at.
Where to get SNP information
NHGRI produces a catalog of all of the genome wide association studies that are published openly — involving thousands of SNPs. The catalog is downloadable as a tab-delimited file. The NHGRI also publishes a detailed description of every column in the tab delimited file, because you need to understand that to get the right data.
On the NHGRI GWA Studies page, you can actually search all the available studies by trait, using a pulldown menu that contains all of the trait keywords in the database. There are all sorts of traits in the database, from serious diseases like Parkinson’s disease, to personality and physical traits, to tendencies for caffeine or alcohol consumption. Multiple SNPs can be associated with a trait, and each one has a row in the big GWAS table.
Let’s say I want to make a panel that tells me how likely I am to be able to get a good tan. By browsing the pulldown menu, I know that there is a “Tanning” keyword available in the list so there must be some SNPs related to getting a tan.
This is part of the table that the NHGRI site will present to you if you search for SNPs using its webpage. In another window, do the search on the NHGRI webpage so you can see the complete table. There is a lot of information here, but key elements that you should recognize show up. The SNP reference identifier is there, along with a “risk allele”, that is, the variant that is most associated with “risk” for the trait. For example, an “A” allele at rs1393350 is associated with an increase in tanning ability. Not surprisingly, when I grep for that SNP in my own data, I get:
reg-10-16-56-94:sandbox cgibas$ cat CJG-23andMe.txt | grep 1393350
rs1393350 11 89011046 GG
Obviously, I could sit down and put together a SNP panel file by hand, but that wouldn’t be very automated, would it? That’s why we need a curation script.
Identify the items that you think you need out of the big GWAS file to present the user with a genome horoscope about any trait.
Find three traits to focus on. Tanning is a little challenging, because half the SNPs were only studied in a European female population. If you are finding the project easy, you can try to work with that set — change your file format, your panel creator script, and your user inputs accordingly. You can also work with simpler traits. For example, you could look at hoarding, birth weight, anger, callous/unemotional behavior, or some of the European hair color SNPs. Male pattern baldness (has a couple of SNPs that don’t have a single defined risk allele), or height in European individuals (requires some parsing of the population studied) are some slightly more challenging cases. Most of these patterns have a single well defined risk allele at each SNP that can be parsed out of the data automagically.
Sketch out the format of a SNP file that you will put them in.
For now, plan to limit your output to simple answers. e.g. NO, you do not have a copy of the A risk allele that is associated with increased tanning ability, or you are homozygous for the A risk allele, or you have one copy.
Write a script that parses the needed information out of the GWAS file.
Write a script that can parse any SNP panel file that is in your self-designed format, compare a 23 and Me profile to it, and produce a report.
Note: Dr. Mays has shared his complete 23 and Me data with us, so there are now two real 23 and Me files on the class Moodle — mine and his.