BINF 6215: Know your UNIX
Before we move on to scripting, we’ve got to get a handle on the rest of the basic UNIX commands covered in Chapters 4 and 5 of the book.
What you already know
- On the first day of class, we took a look at commands for figuring out where you are, setting up your shell environment, and moving files.
- This morning, we looked at the simplest way to “automate” a repetitive task in UNIX — by using wildcards in commands and directing command output to files.
If you don’t remember the navigation commands pwd, ls, and cd, practice using these commands again. Review the section on absolute and relative paths and navigation (pg. 51-55) and practice moving to and listing the contents of different directories using their absolute paths and using relative paths.
What’s next
Start up a shell.
- In your home directory, make a directory called scripts. The command to do this is “mkdir” (p. 56)
- Add that directory to the path in your .bash_profile, following the same procedure that we used to modify the .bash_profile on day 1.
- If you don’t already have a directory in your Documents directory called 6215-exercises, make one.
- Download the example files from Practical Computing for Biologists. The cheat sheets are also pretty useful. Expand the archive and move it to your 6215-exercises directory. Inside your pcfb folder, you’ll have a directory called “sandbox” — when you’re tinkering with creating files and directories, or downloading files from the web, this is a good place to do it.
Command line shortcuts
- Review the “up arrow” shortcut on p. 59
- Review the “tab” shortcut for filename completion on p. 60
- Review “man” and “less” commands on p. 62-63
You’ve now seen and tried out every command in Ch. 4.
cat
The “cat” command streams the contents of a file to standard output. It’s not particularly useful for viewing files (unless they are small) but if you want to merge a group of text files into one, or direct a file into a standard output stream for other reasons (like to pipe into a command that does not take filenames as arguments, which is uncommon but not unheard of), it’s very useful. Work through the practice exercises with “cat” in the book (p. 70-72).
curl
The “curl” command streams the content of a URL to standard output, or to a file if specified. Basically this program is used to get web content without a web browser. First, work through the exercises in the book (p. 78-81). Then we’ll learn something more useful.
Taking advantage of web services with curl and wget
curl is great for doing things like downloading a zipped software archive directly, at the command line, instead of going through the web browser. But it can be very useful to you when combined with web services that make data available through a URL interface. NCBI and EMBL (the European equivalent of NCBI) make a huge number of services available under various web services protocols. Not only database pulls and sequence searches but increasingly more complex kinds of analysis are available via web services. The Taverna workflow system, which we’ll look at a bit if we get a chance, can make use of workflows of web services pulled from EMBL and elsewhere.
The simplest class of services to understand is the URL API access to the nucleotide databases.
If you know the GenBank or INSDC sequence identifier for the sequence you want, you can see it in your web browser using a URL like this one:
http://www.ebi.ac.uk/ena/data/view/AE016795.3&display=fasta
You should see a FASTA nucleotide format file for Chromosome 1 of Vibrio vulnificus CMCP6. Chromosome 2 has the GenBank identifier AE016796.2 — you can grab it from EMBL using this curl command:
curl -s "http://www.ebi.ac.uk/ena/data/view/AE016796.2&display=fasta" > AE016796.2.fasta
Use the “man” command to find out what the command line flag used here means.
Get wget
An alternative to the curl command is the wget command. You might not have wget installed on your computer, but you should be able to install it using the Homebrew Package Manager, which is a thing you want to know about. Type:
brew install wget
And see what happens. Presuming you got a message that wget was either already installed, or that it was installed correctly, give the command:
wget -q -O - "http://www.ebi.ac.uk/ena/data/view/AE016795.3&display=fasta" > AE016795.3.fasta
Check and make sure your file is still the right file.
Pulling data from NCBI
Similarly, if you want to get data from NCBI, you could type:
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AE016795.3&rettype=fasta&retmode=txt" > AE016795.3.fasta
or, to get the file in GenBank format with annotations, you can type:
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AE016795.3&rettype=gb&retmode=txt" > AE016795.3.fasta
NCBI’s efetch functionality is documented (to a certain extent) in one of the NCBI online books.
Use curl or wget to get the GenBank files for both chromosomes of Vibrio vulnificus CMCP6. The GenBank identifiers are AE016795.3 and AE016796.2.
grep (and egrep, fgrep, zgrep, zegrep, zfgrep)
First, work through the exercises in the book on grep (p. 72-78).
In addition to the exercises in Chapter 5, try grep with -A num, which prints the match along with a number of lines of additional context. Consider how this could be useful if you are working with sequence files — if you were searching a FASTA file for matches in the description lines, how many additional lines would you need to follow with to get the DNA sequence belonging to that description line?
You can turn the GenBank files that you downloaded, above, into FASTA protein files to test your grep-fu on. You need a little conversion utility program to change that format.
- Download genbank_to_fasta and unzip it
- Type “which python” to make sure you’ve got a working python installation
- Type “pip install biopython” to get biopython installed if it isn’t already there
- Edit the very top line of the genbank_to_fasta.py script and put the path to your python in there (what you saw when you typed “which python”)
- Move genbank_to_fasta.py to your ~/scripts directory
- Type ls -alf to see if your genbank_to_fasta.py script is executable. If it is, it will have a little * after the filename.
- Type genbank_to_fasta.py -h to see the help for the program
Run the program. An example command line, once you’ve got everything set up correctly, would look like this:
genbank_to_fasta.py -i AE016796.2.gbk -o AE016796.2.aa.fasta -s aa
Now in the files you have created, try looking for proteins whose function has something to do with potassium:
grep "Potassium" AE016795.3.aa.fasta
Then try to get the sequence to come along with the header, using grep with the -A flag. Does it work? What do you get?
grep -A 1 "Potassium" AE016795.3.aa.fasta
(The short answer’s going to be no.) But why not? Isn’t the format of a FASTA file one line of description followed by one line of sequence? Well…not exactly. So we’ll have to learn some additional tricks to pull the whole sequence along with the description that we want.
Meet bioawk
One way to handle this problem is to install the UNIX utility bioawk on your computer. In UNIX, awk is another whole scripting language that is available to you for file manipulations. awk commands can be executed at the command line, inserted into shell scripts, or made into awk scripts that use the awk program as the interpreter rather than the shell. awk is massively powerful and you can self-study it to gain some really useful file manipulation superpowers. However we can’t learn much of it here. To make things even more complicated, the bioinformatics community has developed bioawk which is an add-on to awk that has some specifically defined functions for dealing with common biological data files. Fortunately you don’t have to know everything about it to do a couple of useful tricks.
To install bioawk, just
git clone https://github.com/lh3/bioawk
cp bioawk /usr/local/bin
To give you a flavor of what you can do with it, take the FASTA aa input file that you just tried to process using grep -A. To select sequences out of this file based on length, in this case pulling out only the protein sequences that are longer than 500aa, type:
bioawk -c fastx '{ if(length($seq) > 500) { print ">"$name; print $seq }}' AE016795.3.aa.fasta
Of course you could redirect this into a file by adding > yourfile.aa at the end of the command line. You could do this to a single-end FASTQ file, such as your Ion Torrent files, to apply hard sequence length filters.
Why does this work? Because bioawk stores information from specific sequence types as variables automagically, and instead of trying to work around a FASTA with variable numbers of sequence lines using grep -A, you just read back the $seq variable (however long the sequences is) and you get the complete sequence. You can see what types of data bioawk is ready for by typing:
bioawk -c fasta
Since fasta is not an allowed type, it will spit back the allowed bio file types and the variables they get broken down into on standard output.
If you want to search on the pattern “Potassium” and pull sequences back with an awk one-liner, you’re out of luck because of how bioawk stores the $name and $seq variables. BUT, if you know the actual sequence identifier you’re looking for, like VV1_3124, you could grab that sequence this way.
bioawk -c fastx '{ if(match($name, /VV1_3124/) > 0) { print ">"$name; print $seq }}' AE016795.3.aa.fasta > bioawktest.out
And then if you had a list of those identifiers (perhaps generated by grepping for “Potassium” in the original file, stripping the identifier out of the line, and using it as input to the next command) you could get the sequences you want in a slightly more roundabout way.
Homework
For your homework, I want you to go all the way to the end of the book to Chapter 16 and learn the following commands:
- head
- tail
- cut
- sort
- uniq
In your sandbox, save the files you make by following the tutorials, and save a shell history of the commands you used for each section (use the “history” command — check its manual for options — and your redirection skills). Then we’ll pick up with shell scripting and sequence utility programs tomorrow.