BINF 2111: Protein translation

BINF 2111: Protein translation

Over the last few days, we’ve learned several pieces of python that will add up to a script that translates DNA into protein. Today’s lab is a little bit of a test — can you take what you’ve learned piece by piece and synthesize it together to solve a problem.

Today we wrote a script that turns a genetic code table downloaded from NCBI:

    AAs  = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
  Starts = ---M---------------M------------MMMM---------------M------------
  Base1  = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
  Base2  = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
  Base3  = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

into lists of codons and amino acids:

#! /usr/bin/env python

with open("bacterialcode.txt") as aacode:
	code = aacode.readlines()
	
	AAs = []
	Base1 = []
	Base2 = []
	Base3 = []
	for line in code:
		if line.split()[0] == "AAs":
			for char in line.split()[2]:
				AAs.append(char)
		elif line.split()[0] == "Base1":	
			for char in line.split()[2]:
				Base1.append(char)
		elif line.split()[0] == "Base2":	
			for char in line.split()[2]:
				Base2.append(char)
		elif line.split()[0] == "Base3":	
			for char in line.split()[2]:
				Base3.append(char)
		else:
			pass
			
	codon = []
	for i in range(0,len(Base1)):
		thiscodon = Base1[i] + Base2[i] + Base3[i]
		codon.append(thiscodon)

We learned how to zip those lists into a dictionary — here it is shown with the English to Spanish example:

english = ["one","two","three","four","five"]
spanish = [["uno","dos","tres",”cuatro","cinco"]
eng2sp = dict(zip(english,spanish))

And how to retrieve values based on keys — here’s one way:

print(counts.get(‘TGG’))

We know how to read a big FASTA nucleotide file into a single sequence:

def get_join_fasta(dnafile):
# get fasta file and join sequence lines
	with open(dnafile):
		lines = DNA.readlines()
		sequence = []
		for i in range (0, len(lines)):
    		if lines[i][0:1] != ">":
        		sequence.append(lines[i].strip("\n"))
        
	sequence = ''.join(sequence)
	return sequence    

And to get ranges out of a string with slices:

seqslice = sequence[a:b]

And we know how to read a sequence three characters at a time:

last_codon_start = len(dna) – 2

for start in range(0, last_codon_start, 3):
	codon = dna[start:start + 3]	
	print codon

These are your basic ingredients. With files NC_007898.fasta and NC_007898.gff as input, write a script that slices out the regions of the nucleotide that code for proteins (CDS regions), translates the DNA into proteins, and writes the proteins to a FASTA format file. Get the basic problem solved this week, and then we’ll work out how to structure the script with functions and investigate whether there are ways to use generators to improve this script.

Comments are closed.