BINF 2111: Opening and writing files in python
Last week, we covered conditionals and comparators in python, and this morning we used them to test for sequence length, GC content, and presence of an “adapter”:
DNASeq = raw_input("Enter a DNA Sequence: ")
ccount = DNASeq.count(“C”)
length = len(DNASeq)
cfrac = float(ccount) / length
if cfrac < 0.25:
____print (“C count is abnormally low!”)
See the slides for the rest of the examples if you need them.
Today, we saw how to find the location of a pattern in a sequence and slice out only the part that you want:
# Get input from the user
DNASeq = raw_input("Enter a DNA sequence: ")
# Calculate the length
length = len(DNASeq)
if len(DNASeq) >= 50 and DNASeq[0:10].count(“TACG”) >= 1:
elif len(DNASeq) >= 50 and DNASeq[10:len(DNASeq).count(“TACG”) >= 1:
____ffrag1_len = DNASeq.find(“TACG”)
____ffrag2_len = len(DNASeq) – ffrag1_len
We also saw how to open a file for reading and to assign its content to a variable that you can then work with as usual:
DNA_file = open("seq1.txt")
DNASeq = DNA_file.read()
And we saw how to open a file for writing and write to the file:
fasta_file = open("seq1.fasta”, “w”)
fasta_file.write(“>” + description + ‘\n’ + DNASeq)
And how to close files:
Another way to open and use a file which does not require a close — this is Christina’s favorite method so she can show you how to use it if there’s time:
with open("x.txt") as f:
____data = f.read()
____do something with data
Assignment: write a script that does the following:
- Takes a filename as user input, checks for the existence of the file, and opens it.
- Checks the sequence to make sure it is between 50 and 200 nucleotides long and has a GC content between 40 and 60 percent.
- Checks the sequence to see if it contains the adapter in the forward or reverse direction.
- If the adapter is within the first 10 characters, writes the entire sequence to a FASTA file.
- If the adapter is elsewhere, finds the position of the adapter in the sequence and gets only the sequence to the right of the adapter (regardless of which direction it’s in).
- Writes the desired portion of the sequence to a FASTA file.
- If there is no adapter, discards the sequence and does not write an output file.
The output file you write should be in the FASTA format. Give it a name that has the same basename as the input sequence and a .fasta file extension. To split the original filename at the . and get the basename, use the split method. Split prints out a list of substrings that you get when you split on a particular character. We’ll get more into lists next week, but you can follow the example below and assign your split list to variables. To see how that works, try:
>>> a,b = DNA_file_name.split(".")
>>> print a
>>> print b
Your FASTA file header line should contain the adapter position, GC count, and original source file name.