BINF 2111: Know your UNIX, part 2 (Homework)
For your homework (due next Thursday), I want you to go all the way to the end of the book to Chapter 16 and learn the following commands. I mentioned “cut” and “paste” in class and we will use them in our script tomorrow, but there are a few other useful ones in Ch. 16 as well:
- cut
- paste
- sort
- uniq
- Also do the advanced grep exercises
- Also review head…
- …and tail
These exercises are on page 299-308 in the book.
In your sandbox, save the files you make by following the tutorials, and save a shell history of the commands you used for each section (use the “history” command — check its manual for options — and your redirection skills). That is what you will turn in.
Get coding sequences out of a GenBank file and find specific sequences with grep
In addition to the exercises in Chapter 16, try grep with -A num, which prints the match along with a number of lines of additional context. Consider how this could be useful if you are working with sequence files — if you were searching a FASTA file for matches in the description lines, how many additional lines would you need to follow with to get the DNA sequence belonging to that description line?
You can turn the GenBank files that you downloaded, above, into FASTA protein files to test your grep-fu on. You need a little conversion utility program to change that format.
This will let you practice installing UNIX software and customizing a python script to run on your machine. You’ll also see a little bit of awk which is a UNIX built in mini-language like sed.
- Download genbank_to_fasta and unzip it
- Type “which python” to make sure you’ve got a working python installation
- Type “pip install biopython” to get biopython installed if it isn’t already there
- Edit the very top line of the genbank_to_fasta.py script and put the path to your python in there (what you saw when you typed “which python”)
- Move genbank_to_fasta.py to your ~/scripts directory
- Type ls -alf to see if your genbank_to_fasta.py script is executable. If it is, it will have a little * after the filename.
- Type genbank_to_fasta.py -h to see the help for the program
Run the program. An example command line, once you’ve got everything set up correctly, would look like this:
genbank_to_fasta.py -i AE016796.2.gbk -o AE016796.2.aa.fasta -s aa
Now in the files you have created, try looking for proteins whose function has something to do with potassium:
grep "Potassium" AE016795.3.aa.fasta
Then try to get the sequence to come along with the header, using grep with the -A flag. Does it work? What do you get?
grep -A 1 "Potassium" AE016795.3.aa.fasta
(The short answer’s going to be no.) But why not? Isn’t the format of a FASTA file one line of description followed by one line of sequence? Well…not exactly. So we’ll have to learn some additional tricks to pull the whole sequence along with the description that we want.
Meet bioawk
One way to handle this problem is to install the UNIX utility bioawk on your computer. In UNIX, awk is another whole scripting language that is available to you for file manipulations. awk commands can be executed at the command line, inserted into shell scripts, or made into awk scripts that use the awk program as the interpreter rather than the shell. awk is massively powerful and you can self-study it to gain some really useful file manipulation superpowers. However we can’t learn much of it here. To make things even more complicated, the bioinformatics community has developed bioawk which is an add-on to awk that has some specifically defined functions for dealing with common biological data files. Fortunately you don’t have to know everything about it to do a couple of useful tricks.
Bioawk is already installed on the lab computers. To give you a flavor of what you can do with it, take the FASTA aa input file that you just tried to process using grep -A. To select sequences out of this file based on length, in this case pulling out only the protein sequences that are longer than 500aa, type:
bioawk -c fastx '{ if(length($seq) > 500) { print ">"$name; print $seq }}' AE016795.3.aa.fasta
Of course you could redirect this into a file by adding > yourfile.aa at the end of the command line. You could do this to a single-end FASTQ file, such as your Ion Torrent files, to apply hard sequence length filters.
Why does this work? Because bioawk stores information from specific sequence types as variables automagically, and instead of trying to work around a FASTA with variable numbers of sequence lines using grep -A, you just read back the $seq variable (however long the sequences is) and you get the complete sequence. You can see what types of data bioawk is ready for by typing:
bioawk -c fasta
Since fasta is not an allowed type, it will spit back the allowed bio file types and the variables they get broken down into on standard output.
If you want to search on the pattern “Potassium” and pull sequences back with an awk one-liner, you’re out of luck because of how bioawk stores the $name and $seq variables. BUT, if you know the actual sequence identifier you’re looking for, like VV1_3124, you could grab that sequence this way.
bioawk -c fastx '{ if(match($name, /VV1_3124/) > 0) { print ">"$name; print $seq }}' AE016795.3.aa.fasta > bioawktest.out
And then if you had a list of those identifiers (perhaps generated by grepping for “Potassium” in the original file, stripping the identifier out of the line, and using it as input to the next command) you could get the sequences you want with a script.