Last week, we covered conditionals and comparators in python, and this morning we used them to test for sequence length, GC content, and presence of an “adapter”:
DNASeq = raw_input("Enter a DNA Sequence: ")
ccount = DNASeq.count(“C”)
length = len(DNASeq)
cfrac = float(ccount) / length
if cfrac < 0.25:
____print (“C count is abnormally low!”)
See the slides for the rest of the examples if you need them.
Today, we saw how to find the location of a pattern in a sequence and slice out only the part that you want:
# Get input from the user
DNASeq = raw_input("Enter a DNA sequence: ")
# Calculate the length
length = len(DNASeq)
if len(DNASeq) >= 50 and DNASeq[0:10].count(“TACG”) >= 1:
elif len(DNASeq) >= 50 and DNASeq[10:len(DNASeq).count(“TACG”) >= 1:
____ffrag1_len = DNASeq.find(“TACG”)
____ffrag2_len = len(DNASeq) – ffrag1_len
We also saw how to open a file for reading and to assign its content to a variable that you can then work with as usual:
DNA_file = open("seq1.txt")
DNASeq = DNA_file.read()
And we saw how to open a file for writing and write to the file:
fasta_file = open("seq1.fasta”, “w”)
fasta_file.write(“>” + description + ‘\n’ + DNASeq)
And how to close files:
Another way to open and use a file which does not require a close — this is Christina’s favorite method so she can show you how to use it if there’s time:
with open("x.txt") as f:
____data = f.read()
____do something with data
Assignment: write a script that does the following:
- Takes a filename as user input, checks for the existence of the file, and opens it.
- Checks the sequence to make sure it is between 50 and 200 nucleotides long and has a GC content between 40 and 60 percent.
- Checks the sequence to see if it contains the adapter in the forward or reverse direction.
- If the adapter is within the first 10 characters, writes the entire sequence to a FASTA file.
- If the adapter is elsewhere, finds the position of the adapter in the sequence and gets only the sequence to the right of the adapter (regardless of which direction it’s in).
- Writes the desired portion of the sequence to a FASTA file.
- If there is no adapter, discards the sequence and does not write an output file.
The output file you write should be in the FASTA format. Give it a name that has the same basename as the input sequence and a .fasta file extension. To split the original filename at the . and get the basename, use the split method. Split prints out a list of substrings that you get when you split on a particular character. We’ll get more into lists next week, but you can follow the example below and assign your split list to variables. To see how that works, try:
>>> a,b = DNA_file_name.split(".")
>>> print a
>>> print b
Your FASTA file header line should contain the adapter position, GC count, and original source file name.