Getting your 23 and Me data into Galaxy
I just happen to have a file with 960,628 lines of personal SNP data from 23 and Me burning a hole in my hard drive. I’m one of the lucky people who gets 23 and Me’s extended health information because I bought my genotype at the right time, before the FDA got all handsy. But if I wanted to do more with that information on my own, could I?
- If I wanted to search for SNPs and traits one by one, I could use SNPedia.
- I could do the same thing in the UCSC Genome Browser.
- If I want to start from a disease perspective, I could use OMIM.
- If I wanted to search for SNPs that have an association in a GWAS study, I could use GOCI, and either search based on broad categories or keywords.
- If I wanted to really get into an interface that delivered me more detail than I could process with my eyeballs, I could use dbSNP.
Next time I’ll write about each of these sources and what you can learn there, but first, let’s do something a little more high-throughput. Because 960,628 is a lot of one-by-one.
Convert your data to VCF format
The first thing that we need to do is get this special 23 and Me file into a form Galaxy can deal with. Surely someone has wanted to do this before, right? Well, some guy called arrogantrobot has created a perl script that converts a 23 and Me file into a VCF file, which is a file format that Galaxy definitely traffics in. To get it, you just:
git clone https://github.com/arrogantrobot/23andme2vcf
in a directory where you’ve got write access. Then, run it on your data:
perl ../23andme2vcf/23andme2vcf.pl genome_Cynthia_Gibas_Full_20140719094157.txt genome_Cynthia_Gibas_Full_20140719094157.vcf
(Make sure that a copy of the reference genome file that’s distributed with 23andme2vcf is sitting in your working directory.)
The contents of your output VCF file should look like this:
Get it into Galaxy
Now you can upload the VCF file into Galaxy. Simply select the file to upload, allow Galaxy to auto-detect the file, and associate the file with the GrCh37/hg19 genome build.
Galaxy should find your newly-created file acceptable.
This is not going to easily tell you much about what individual SNPs mean, but it’s a first thing to do with your data — visualize it in Trackster. All you need to do is select your data and select the same genome build you used as a reference when you imported it. Here’s a section of my data, visualized on Chromosome 1.
Create a custom UCSC browser track with your data
The next thing you can do is convert your data into a genome browser track for the UCSC genome browser. First, select VCF to MAF custom track from the Graph/Display Data section in the Galaxy Tool panel.
Once it’s done converting, open the data set’s detailed information in the history panel and choose display at UCSC Main:
HEY LOOK MOM I’M IN THE UCSC GENOME BROWSER!!!
1337 genome h4x0r!