BINF 2111: Know your UNIX, part 1
Before we move on to scripting, we’ve got to get a handle on the basic UNIX commands covered in Chapters 4 and 5 of the book. You can’t write a script until you understand the commands you want to give to the computer!
What you already know from the first day of class: review!
- You know how to open a shell window on your computer.
- You know how to find out where you are in the filesystem using pwd.
- You know how to see the files in the folder where you are using ls or ls -a.
- You know how to change directories using cd and move up the file tree using cd ../
- You know that you need to edit any scripts you make using a text-only editor like TextWrangler.
The Codecademy exercise will give you practice using these commands. You can also review the section on absolute and relative paths and navigation (pg. 51-55) in the book.
Start up a shell.
- In your /Volumes/yourname home directory (the shared home directory), make a directory called BINF-6211. The command to do this is “mkdir” (p. 56)
- Download the example files from Practical Computing for Biologists. The cheat sheets are also pretty useful. Expand the archive in your BINF-6211 directory. Inside your pcfb folder, you’ll have a directory called “sandbox” — when you’re tinkering with creating files and directories, or downloading files from the web, this is a good place to do it.
Basic shell behavior
You can tell what kind of shell you are in by typing
What kind of shell does your Mac use?
Setting shell behavior
If you type
you will get a description of some very important things, including the path where your command line interpreter looks for programs that you have installed, like so
It’s really important that your environment is set up so that you’re looking for programs in the right path when you try to run them from scripts or at the command line. You can add directories to your path by creating a .bash_profile file containing the command:
which would add the directory /Users/yourname/newdirectory/directory to your path in any new shell window you open. If you don’t have a .bash_profile file already, the $PATH variable is set in a system-level file, and you can make sure you get both the system path, and the new directories you are adding to your path, by including the variable $PATH in your .bash_profile variable.
If you don’t have the right directories in your path and you type the name of a program into your shell, nothing will happen because the system does not know where to look for that program.
Finding programs on your computer
If you want to know whether your system can find a particular program, use the which command. First let’s see if your system has the “vi” editor. Type:
Your system should return a directory path location in each case.
Now create a file in TextWrangler. Add the text:
By adding this text, you are telling the shell interpreter to look for executable programs in the system path, but to also look for executable programs in a directory called software/bin, which is under your home directory.
Save your file as .bash_profile in your home directory.
Once you have saved the file, type:
“source” is the command that tells the shell interpreter to read instructions from a shell script.
Moving around in the filesystem
To move from directory to directory in the file system, you use the command “cd”. “pwd” told you that you were located in the directory /Users/yourname. But what if you want to move to another location? You might have noticed that in your system path, /usr/local/bin is one of the places that your shell interpreter looks for executable programs. To move to that directory, type:
to see what programs are present. Another location you might be interested in is your Dropbox. If you don’t have a Dropbox account by now, you should get one. If you do and you have connected to it from the lab computer, your Dropbox is probably located in /Users/yourname/Dropbox.
Commands and flags
Try using some of the commands you are used to (for instance, cp) followed by -h. That’s what’s called a flag (or option). Flags change settings in the program you are running, and can also be followed by a parameter value. Flags are different for every UNIX command and program — their usage is specific to the command — although you’ll often find that -i followed by a filename is input, -o followed by a filename is output, and some other things may be common between command flags. To find out what command line options cp command can be used with, just type
at the command line with no other flags or filenames. And to get a full online manual page describing the command usage in detail, or command line help type:
Not all programs will have UNIX manpages, but most programs will have a command line help option.
Now let’s look at command line options for a little program called nucmer, which should be on your workstation. nucmer is part of the MUMmer package of sequence analysis software. MUMmer is not a native UNIX program, but most developers will provide similar command line feedback options with programs that are meant to run at the command line. So you can type just “nucmer” to get the usage, and “nucmer -h” to get a help page describing all the options.
- Is nucmer on your machine?
- Where is the executable program located?
- What does nucmer do?
- What are mandatory elements that have to be present for nucmer to run?
- How do you change the prefix of the output files from nucmer?
- How do you set the program to use matches that are unique in both the query and the reference?
When you start working with command line programs to build pipelines, examining these options and making sure of what your parameters should be, where the input will come from and where the output will flow to is a critical step.
Command line shortcuts
- Review the “up arrow” shortcut on p. 59
- Review the “tab” shortcut for filename completion on p. 60
- Review “man” and “less” commands on p. 62-63
You’ve now seen and tried out every command in Ch. 4.
The “cat” command streams the contents of a file to standard output. It’s not particularly useful for viewing files (unless they are small) but if you want to merge a group of text files into one, or direct a file into a standard output stream for other reasons (like to pipe into a command that does not take filenames as arguments, which is uncommon but not unheard of), it’s very useful. Work through the practice exercises with “cat” in the book (p. 70-72).
The “curl” command streams the content of a URL to standard output, or to a file if specified. Basically this program is used to get web content without a web browser. First, work through the exercises in the book (p. 78-81). Then we’ll learn something more useful.
Taking advantage of web services with curl
curl is great for doing things like downloading a zipped software archive directly, at the command line, instead of going through the web browser. But it can be very useful to you when combined with web services that make data available through a URL interface. NCBI and EMBL (the European equivalent of NCBI) make a huge number of services available under various web services protocols. Not only database pulls and sequence searches but increasingly more complex kinds of analysis are available via web services. The Taverna workflow system, which we’ll look at a bit if we get a chance, can make use of workflows of web services pulled from EMBL and elsewhere.
The simplest class of services to understand is the URL API access to the nucleotide databases.
If you know the GenBank or INSDC sequence identifier for the sequence you want, you can see it in your web browser using a URL like this one:
You should see a FASTA nucleotide format file for Chromosome 1 of Vibrio vulnificus CMCP6. Chromosome 2 has the GenBank identifier AE016796.2 — you can grab it from EMBL using this curl command:
curl -s "http://www.ebi.ac.uk/ena/data/view/AE016796.2&display=fasta" > AE016796.2.fasta
Use the “man” command to find out what the command line flag used here means.
Pulling data from NCBI
Similarly, if you want to get data from NCBI, you could type:
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AE016795.3&rettype=fasta&retmode=txt" > AE016795.3.fasta
or, to get the file in GenBank format with annotations, you can type:
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AE016795.3&rettype=gb&retmode=txt" > AE016795.3.fasta
NCBI’s efetch functionality is documented (to a certain extent) in one of the NCBI online books.
Use curl or wget to get the GenBank files for both chromosomes of Vibrio vulnificus CMCP6. The GenBank identifiers are AE016795.3 and AE016796.2.
grep (and egrep, fgrep, zgrep, zegrep, zfgrep)
Work through the exercises in the book on grep (p. 72-78).
In your sandbox, save the files you make by following the tutorials, and save a shell history of the commands you used for each section. Command:
history > myhistoryfile.txt