BINF 6215: Shell scripting 101
Shell scripting is a powerful way to string commands together, make commands repeat themselves on a list of files, and all manner of other useful conglomerations of function. Scripting isn’t really “programming” per se — it’s one level of abstraction above. A script is just a file that contains a list of commands that you want to execute, in the order you want them executed. It’s also possible to create simple loops in shell scripts so that commands can be repeated.
On the first day of class we used Apple Automator to make a simple file format conversion service that can be accessed from the Finder. What if I want to automate a kind of file conversion that my operating system doesn’t know about, though? Like converting downloaded SRA files to fastq files?
Well — here’s a little Automator script that runs a shell script that runs fastq-dump. Automator grabs the input; the script moves to a working directory where I want to collect all my NGS data and dumps out the fastq files, one by one for each item in the input. Not the most sophisticatd script, perhaps, but it does its little job.
What the script says is simply, for every value (f) in the input list (of files), run this command. When there’s no more values, you’re done.
This is actually intermediate script-fu so let’s start back a few steps.
Getting set up
Follow the instructions in the book, pp. 85-99, to set up a scripts directory in your home directory, point your shell path to that directory, and write the small demonstration scripts that list directories and change filenames. Once you can do these things successfully we can move on to some more complex problems.
Redirects and pipes
Redirects, as we learned earlier, take the standard output of a command and send it to a file. Pipes are another UNIX construction. They take the standard output of a command and treat it as the input to a subsequent command.
grep ATOM structure_1gfl.pdb | grep -v REMARK
In English, this says get all the lines in file structure_1gfl.pdb that contain the string “ATOM”, and then, from those lines, take only the ones that DON’T contain the string “REMARK”.
As your homework, you should have taken a look at the head, tail, cut, sort and uniq commands in Ch. 16 of the book. On p. 303-306, there is an example exercise that shows you how to pipe together several of these basic UNIX command line operations, to turn a PDB file into a table of amino acid frequencies. Work through that tutorial now.
Pipes can be used to connect any kind of programs — as long as they accept input from stdin and write output to stdout, and as long as there is only one input and output stream that you need to capture at each step.
So now that we understand what can be done on a single UNIX command line, it’s time to start putting command lines together into scripts. We will start with the very basics.
Required elements of a script
To turn a series of command lines into a script, you need to format it in a particular way. Here’s an example of a very crude script that I wrote to pre-process some data sets so that they could just all get done in the background without me checking on them. It’s just a bunch of command lines with a #! /bin/bash at the top. I didn’t do anything fancy here, I just built the command lines in vi, explicitly for each file that I wanted to process.
#! /bin/bash
~/Applications/seqtk/seqtk sample -s100 CMCP6_1ASW_forward_paired.fastq 0.2 > CMCP6_1ASW_forward_paired.red.fastq
~/Applications/seqtk/seqtk sample -s100 CMCP6_1ASW_forward_unpaired.fastq 0.2 >CMCP6_1ASW_forward_unpaired.red.fastq
~/Applications/seqtk/seqtk sample -s100 CMCP6_1ASW_reverse_paired.fastq 0.2 > CMCP6_1ASW_reverse_paired.red.fastq
~/Applications/seqtk/seqtk sample -s100 CMCP6_1ASW_reverse_unpaired.fastq 0.2 >CMCP6_1ASW_reverse_unpaired.red.fastq
~/Applications/seqtk/seqtk sample -s100 CMCP6_1HS_forward_paired.fastq 0.2 > CMCP6_1HS_forward_paired.red.fastq
~/Applications/seqtk/seqtk sample -s100 CMCP6_1HS_forward_unpaired.fastq 0.2 > CMCP6_1HS_forward_unpaired.red.fastq
~/Applications/seqtk/seqtk sample -s100 CMCP6_1HS_reverse_paired.fastq 0.2 > CMCP6_1HS_reverse_paired.red.fastq
~/Applications/seqtk/seqtk sample -s100 CMCP6_1HS_reverse_unpaired.fastq 0.2 > CMCP6_1HS_reverse_unpaired.red.fastq
~/Applications/seqtk/seqtk sample -s100 CMCP6_2ASW_forward_paired.fastq 0.2 > CMCP6_2ASW_forward_paired.red.fastq
~/Applications/seqtk/seqtk sample -s100 CMCP6_2ASW_forward_unpaired.fastq 0.2 >CMCP6_2ASW_forward_unpaired.red.fastq
~/Applications/seqtk/seqtk sample -s100 CMCP6_2ASW_reverse_paired.fastq 0.2 > CMCP6_2ASW_reverse_paired.red.fastq
~/Applications/seqtk/seqtk sample -s100 CMCP6_2ASW_reverse_unpaired.fastq 0.2 >CMCP6_2ASW_reverse_unpaired.red.fastq
~/Applications/seqtk/seqtk sample -s100 CMCP6_2HS_forward_paired.fastq 0.2 > CMCP6_2HS_forward_paired.red.fastq
~/Applications/seqtk/seqtk sample -s100 CMCP6_2HS_forward_unpaired.fastq 0.2 > CMCP6_2HS_forward_unpaired.red.fastq
~/Applications/seqtk/seqtk sample -s100 CMCP6_2HS_reverse_paired.fastq 0.2 > CMCP6_2HS_reverse_paired.red.fastq
~/Applications/seqtk/seqtk sample -s100 CMCP6_2HS_reverse_unpaired.fastq 0.2 > CMCP6_2HS_reverse_unpaired.red.fastq
I can make this executable by typing:
chmod u+x
I can run it by typing:
sh samplescript.sh
It’s worth noting that I didn’t build this input file by typing out each of these command lines manually. Instead, I did the following:
- Used the ls command to list all the input files and direct them into a temporary file.
- Edited one copy of that temporary file in vi to change the filenames to insert “*.red” into the middle of the filename, to distinguish them from the inputs
- Used history | grep seqtk to remind me of the command line format I’d used when running seqtk manually
- Edited a second copy of the temporary file in vi to insert the non-variable parts of the command line before and after the input filename
- Used the UNIX command “paste” to merge the two files together into the final script
- Used vi to add the #! /bin/bash line to the top of the file
So even if you haven’t learned all there is to know about variables and inputs to your shell script, UNIX can still make it easy to create a crude script that will do what you want WITHOUT laboriously typing out each command line or manually changing characters. This makes it more likely that you will not make mistakes when creating filenames and that your files will be named systematically and be easy to track.