BINF 2111: Protein translation
Over the last few days, we’ve learned several pieces of python that will add up to a script that translates DNA into protein. Today’s lab is a little bit of a test — can you take what you’ve learned piece by piece and synthesize it together to solve a problem.
Today we wrote a script that turns a genetic code table downloaded from NCBI:
AAs = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG Starts = ---M---------------M------------MMMM---------------M------------ Base1 = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG Base2 = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG Base3 = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
into lists of codons and amino acids:
#! /usr/bin/env python with open("bacterialcode.txt") as aacode: code = aacode.readlines() AAs = [] Base1 = [] Base2 = [] Base3 = [] for line in code: if line.split()[0] == "AAs": for char in line.split()[2]: AAs.append(char) elif line.split()[0] == "Base1": for char in line.split()[2]: Base1.append(char) elif line.split()[0] == "Base2": for char in line.split()[2]: Base2.append(char) elif line.split()[0] == "Base3": for char in line.split()[2]: Base3.append(char) else: pass codon = [] for i in range(0,len(Base1)): thiscodon = Base1[i] + Base2[i] + Base3[i] codon.append(thiscodon)
We learned how to zip those lists into a dictionary — here it is shown with the English to Spanish example:
english = ["one","two","three","four","five"] spanish = [["uno","dos","tres",”cuatro","cinco"] eng2sp = dict(zip(english,spanish))
And how to retrieve values based on keys — here’s one way:
print(counts.get(‘TGG’))
We know how to read a big FASTA nucleotide file into a single sequence:
def get_join_fasta(dnafile): # get fasta file and join sequence lines with open(dnafile): lines = DNA.readlines() sequence = [] for i in range (0, len(lines)): if lines[i][0:1] != ">": sequence.append(lines[i].strip("\n")) sequence = ''.join(sequence) return sequence
And to get ranges out of a string with slices:
seqslice = sequence[a:b]
And we know how to read a sequence three characters at a time:
last_codon_start = len(dna) – 2 for start in range(0, last_codon_start, 3): codon = dna[start:start + 3] print codon
These are your basic ingredients. With files NC_007898.fasta and NC_007898.gff as input, write a script that slices out the regions of the nucleotide that code for proteins (CDS regions), translates the DNA into proteins, and writes the proteins to a FASTA format file. Get the basic problem solved this week, and then we’ll work out how to structure the script with functions and investigate whether there are ways to use generators to improve this script.