BINF 2111: Do SNPs make a difference
In the next script, we’ll combine the information from genomic FASTA file, GFF, and VCF to see if the variants change the protein sequence. The big challenge in this problem is integrating three sources of information that don’t all use the same unique identifying information.
- The FASTA file has the actual sequence (which can be put in the form of a big string).
- The GFF CDS records have the coordinates of proteins.
- The VCF has the location and sequence of SNPs or indels.
Get your GFF and VCF streamlined and passed to the script
Here’s what we need to do — in words:
- filter GFF and VCF with UNIX and bedtools commands to streamline
- grep for CDS
- get variants that intersect CDS with bedtools
This part of the work can easily be done in a shell script external to the python script, as long as your python script will take input. This is what the sys.argv and getopt modules introduced in class will do.
Here is a very bare-bones script that gets the filenames as usable arguments from the command line with sys.argv():
#! /usr/bin/env python
import sys
print "Name of the script: " + sys.argv[0] print "Number of arguments: " + str(len(sys.argv)) print "Name of the FASTA file: " + sys.argv[1] print "Name of the GFF file: " + sys.argv[2] print "Name of the VCF file: " + sys.argv[3] print "Name of the genetic code file: " + sys.argv[4]
Here is the command line used:
vcftranslate.py NC_007898.fasta NC_007898.gff BC21_BINF6350_Summer2014_13pm.vcf bacterialcode.txt
And here is what it prints.
Name of the script: /Users/cgibas/scripts/vcftranslate.py
Number of arguments: 5
Name of the FASTA file: NC_007898.fasta
Name of the GFF file: NC_007898.gff
Name of the VCF file: BC21_BINF6350_Summer2014_13pm.vcf
Name of the genetic code file: bacterialcode.txt
And here is a shell script that will preprocess a set of input files you give it and pass the desired outputs to your python script:
#! /bin/bash # Program expects two inputs: 1 - genome basename (FASTA and GFF) and 2 - VCF filename
#get only the coding sequence records from the gff grep CDS $1.gff > $1.cds.gff #get the SNPs/indels that intersect a coding region from the vcf bedtools intersect -a $2 -b $1.cds.gff > $1.cds.vcf #now your inputs are as streamlined as they can get -- feed them to your python script vcftranslate.py $1.fasta $1.cds.gff $1.cds.vcf bacterialcode.txt
This is super basic. There are ways you could make your python script more foolproof — check for the correct number of arguments, check whether the filenames passed as arguments have the correct file extension, etc. You could also use the getopt module to get options with flags so that the filenames don’t have to be put in an exact order.
Merge your GFF and FASTA into a searchable data structure
The GFF has the coordinates, the FASTA has the sequence, and the VCF has the SNP and location
It would probably be easiest if we could combine the basic information in the GFF and the FASTA before we try to merge with the VCF too. How about a dictionary with an identifier from the GFF as a key, and the coding sequence as the associated value?
- During lab, we realized that we needed to do this with a key made up of coordinates in a tuple, because the sequence identifiers alone aren’t actually unique.
- The namedtuple data structure looked like it might be useful for this purpose.
Here’s a snippet of code that puts the critical fields of the GFF and the coding sequence snipped from the FASTA into a dictionary with namedtuples as keys:
gffin = raw_input("What is your GFF file? ") origseqDict = {} with open(gffin) as gff: for first, last, strand, name in read_gff(gff): cds = namedtuple("cds", ["start", "end", "dir", "id"]) modfirst = int(first) - 1 modlast = int(last) - 1 coding = sequence[modfirst:modlast] origseqDict[cds(start=first, end=last, dir=strand, id=name)] = coding print origseqDict
And here is an example of what that print statement prints out:
cds(start='118561', end='118806', dir='-', id='YP_008563140.1'): 'TCAATAAGCTAGACCCATACTGCGAGTTGTTTCATGCCATAAATAAACTCGGACACTCAAGAAATCTGTTGGACAGGCGGATTCACATCTCTTACAACCAACACAGTCCTCTGTTCGTGGAGCAGAAGCAATTTGTTTAGCCTTACAACCGTCCCAAGGTATCATTTCTAATACATCAGTGGGGCAGGCTCGGACACATTGAGTACATCCTATACACGTATCATAAATCTTTACTGAATGTGACA', cds(start='142153', end='142620', dir='+', id='YP_008563148.1'): 'ATGTCACGTCGAGGTACTGCAGAAAAAAAAACAGCAAAATCCGATCCAATTTATCGTAATCGATTAGTTAACATGTTGGTTAACCGTATTCTGAAACACGGAAAAAAATCATTGGCTTATCAAATTATCTATCGAGCCGTGAAAAAGATTCAACAAAAGACAGAAACAAATCCACTATCCGTTTTACGTCAAGCAATACGTGGAGTAACTCCCGATATAACAGTAAAAGCAAGACGTGTAGGTGGATCGACTCATCAAGTTCCCATTGAAATAGGATCCACACAAGGAAAAGCACTTGCCATTCGTTGGTTATTAGCGGCATCCCGAAAACGTCCGGGTCGAAATATGGCTTTCAAATTAAGTTCCGAATTAGTGGATGCTGCCAAAGGGAGTGGCGATGCCATACGCAAAAAGGAAGAGACTCATAGAATGGCAGAGGCAAATAGAGCTTTTGCACATTTTCGTTA', cds(start='76066', end='76170', dir='+', id='YP_008563115.1'): 'ATGGAAGCATTGGTTTATACATTCCTCTTAGTCTCGACTCTAGGGATAATTTTTTTCGCTATCTTTTTTCGAGAACCGCCTACAATTCGAACTAAAAAGAACTA'
Now how do you get things back out of this fancy data structure? Well. Consider that what you’re going to try to match to the dictionary keys is a SNP coordinate which will fall somewhere between the ‘start’ and ‘end’ values. So you’re going to want to iterate over the VCF file, and each time you read a line out of the VCF file, you’re going to want to iterate over your dictionary and see if the value of ‘position’ in the SNP file matches any row in your dictionary.
Here’s a code snippet that tests to see if an arbitrary SNP position falls inside any of the coordinate ranges in our crazy tuple/dict hybrid keys:
testcoord = 143000 for key in sorted(origseqDict.keys()): if int(key.start) <= testcoord and int(key.end) >= testcoord: print key else: pass
And here’s what it prints out. I picked a testcoord value that fell within one of my CDS regions.
cds(start='142906', end='143682', dir='+', id='YP_008563149.1')
Obviously, you could do the same with the values that you’re pulling from your VCF file one at a time.
Check individual records in the VCF to find which sequence they match, and alter the sequence
- loop through VCF
- split VCF line getting SNP position, original seq and alt strings
- use a list comprehension to loop through GFF list getting coords and identifier from link
- subtract snppos – genestart distance into the gene sequence
- get your sequence from sequence dictionary
- sequence[0:distancein], [distancein + 1:len(orig)], rest of sequence
- split sequence stuff before the SNP, orig SNP sequence, stuff after the snp
- save sequence as stuff before the SNP, alt SNP sequence, stuff after the SNP
Translate it and show the user if the SNP makes a difference
- translate your new sequence!
- your script could notice stop codons
- compare to the old sequence
- old: PROTIENPROTEINPROTEINPROTEIN
- new: PROTEINPROTEINPROTEINPROTEIN
I’ll probably add some hints to this blog post later. Be prepared to work on your script in class tomorrow.