BINF 2111: Variables and conditionals in bash
A good script should be:
- Clearly organized
- Correct in how it performs its task
Today you’re going to learn two key elements of bash scripting: use of variables, and use of a simple conditional test. We’ll also see how you can use the UNIX pipe operator to simplify your script and cut down on intermediate files, and how you can use the echo command to make your script inform you about its progress.
What you will turn in:
- Your script with echo commands added to show progress
- Your script with the FASTX Toolkit commands piped together
- Your script with variables instead of hardcoded filenames
- Your script with a conditional test for a reference genome filename
There will be a grade item for each one of these files.
At this point you should have a series of command lines that you have vetted by manually testing the complete workflow on one of your chloroplast sequences. Here’s what my solution looks like. This order of operations and choice of parameters gave me the best assembly I have been able to get for the BC30 data set so far. Your script may be a little different than this (and contain two clipping steps). That’s fine, as long as it works.
fastx_clipper -v -Q33 -a CCTCTCTATGGGCAGTCGGTGAT -i BC30_BINF6350_Summer2014_13pm.fastq -o BC30_BINF6350_Summer2014_13pm.clipTrP1.fastq
fastx_trimmer -v -Q33 -f 9 -l 289 -i BC30_BINF6350_Summer2014_13pm.clipTrP1.fastq -o BC30_BINF6350_Summer2014_13pm.trim.fastq
fastq_quality_filter -v -Q33 -q 20 -p 50 -i BC30_BINF6350_Summer2014_13pm.trim.fastq -o BC30_BINF6350_Summer2014_13pm.qualfilt.fastq
/usr/local/bin/spades.py --iontorrent -s BC30_BINF6350_Summer2014_13pm.qualfilt.fastq -o BC30spades -k 21,33,55,77,99,127
~/Applications/quast-2.3/quast.py -o BC30quast -R NC_007898.fasta -G NC_007898.gff BC30spades/contigs.fasta
Making your script keep you informed
A simple change that we can make to the assembly script is to accumulate the standard output of each of the steps into a file, and make the script send progress updates to stdout instead. You can use the >> (append redirect) to send all your steps’ standard output to the same file, and the echo command (p. 85-86 and p. 312 in your book) to send yourself a little message from your script. Here’s an example from a script with two fastx_clipper steps:
fastx_clipper -v -n -a CCATATCATCCCTGCGTGTCTCCCACTCAG -i BC21_BINF6350_Summer2014_13pm.fastq -o BC21_BINF6350_Summer2014_13pm.clipA1.fastq >> BC21.all.stdout
fastx_clipper -v -n -a CCTCTCTATGGGCAGTCGGTGAT -i BC21_BINF6350_Summer2014_13pm.clipA1.fastq -o BC21_BINF6350_Summer2014_13pm.clipTrP1.fastq >> BC21.all.stdout
echo 'finished fastx_clipper'
Do this for all the main steps in your script.
Echo commands aren’t just for sending little messages to your screen — they are really helpful when testing to see if program logic structures work and if variables have been captured correctly.
Cutting down on superfluous outputs
Now that you are sure your command lines are working, it’s probably time to stop producing a bunch of superfluous FASTQ outputs. Even for the chloroplast data those files get kind of big. When you first started with the FASTX Toolkit, you learned that each of its programs can send output to standard output, and take input from standard input, rather than creating and opening files. The pipe | operator will let you string commands together this way in your scripts and at the command line. For example, for the command lines above, you could type:
fastx_clipper -v -n -a CCATATCATCCCTGCGTGTCTCCCACTCAG -i BC21_BINF6350_Summer2014_13pm.fastq | fastx_clipper -v -n -a CCTCTCTATGGGCAGTCGGTGAT -o BC21_BINF6350_Summer2014_13pm.clipTrP1.fastq >> BC21.all.stdout
This would have the same results as running the two steps independently. Note: you still need an input filename for the first step, and an output filename for the last step, but all of the steps between can just use standard input and output instead of filenames.
Connect all of your FASTX Toolkit commands with pipes to form a single command line in your script.
Using variables in your script
This script would be more useful (and more reusable) if it could take a filename as input and name its output files automatically to match, instead of having filenames hardcoded. Bash makes this easy. If you run a shell command followed by command-line arguments, the arguments are automatically assigned as variables $1 $2 $3 $4…etc. These variables can then be substituted for values in the script.
So if I want my chloroplast script to take input from the command line, what would be useful to take? How do I determine this?
Look for repeating patterns. There are two repeating patterns that occur throughout the script. The base name of the input fastq file, and the base name used to name the output directories from SPAdes and Quast. I could set the script up to use those two variables, like so:
sh cp-takeinput.sh BC21_BINF6350_Summer2014_13pm BC21
Thus BC21_BINF6350_Summer2014_13pm becomes $1 and BC21 becomes $2. Bash Guru has more on this if you want to know how large numbers of arguments work.
If I set up my script using these two variables instead of explicit filenames, here is how part of it might look. (In this example I have not joined all my FASTX Toolkit commands together yet).
fastq_quality_filter -v -q 12 -p 90 -i $1.trim.fastq -o $1.qualfilt.fastq >> $2.all.stdout
echo 'finished fastq_quality_filter'
spades.py --iontorrent -s $1.qualfilt.fastq -o $2spades -k 21,33,55,77,99,127 >> $2.all.stdout
echo 'finished SPAdes'
At the top of my script, I should add some comments, describing what variables the script expects on the command line.
# call this script with two parameters
# 1 - fastq basename 2 - directory basename
Change all the filenames in your script to variables, and check that it will run correctly. Try it on more than one of your chloroplast files to make sure.
Using a conditional in your script
You might have noticed that there is one more repeating value in your original script that could be replaced with a variable — the reference genome that QUAST uses to evaluate assembly quality. QUAST doesn’t require a reference genome, you can run it without and still get a partial report.
When you use a reference genome as input, the QUAST command line includes two flags -R and -G along with their values, and it looks like this:
quast -o $2quast -R NC_007898.fasta -G NC_007898.gff $2spades/contigs.fasta >> $2.all.stdout
While if you do not give a reference genome as input, the command line looks like this:
quast -o $2quast $2spades/contigs.fasta >> $2.all.stdout
It would be useful to allow the user of your script to give a reference genome if one is available.
So how do you do that? Well, if you remember how variables on the command line work, if there were a third parameter entered on the command line, it would end up as variable $3. Here’s an example of how you’d describe that to the user.
# call this script with two required parameters, third parameter optional
# 1 - fastq basename 2 - directory basename 3 - reference basename
So if you gave the command
sh cp-conditional.sh BC21_BINF6350_Summer2014_13pm BC21 NC_007898
The variable $3 would have the value NC_007898 and that could be used as the basename of input FASTA genome and GFF files. The conditional logic that you want your program to follow is:
IF the variable $3 has a value, THEN use a QUAST command line that includes a reference genome.
ELSE use a QUAST command line that does not include a reference genome.
The syntax for that conditional logic in BASH uses an “if” statement to open and an “fi” to close the structure after the options for both “then” and “else” have been defined. You can see this in the example below.
Prior to making a real script that executes all of your assembly commands, you can just write a very simple script that tests whether $3 has been given a value. The [ $variable ] structure is called a test operator. If there are no other operators used inside the nested brackets, the test is simply for whether the variable is null.
if [ $3 ]
# variable is not null
echo 'we have a reference genome'
# variable is null
echo 'no reference genome'
First, make the little script above and run it with arguments to see how it behaves. Then, adapt your assembly script to include this structure, substituting the appropriate QUAST commands for the echo commands above.
There are many ways to test for other requirements — you could test to see whether a file exists, for instance. In English, the if statement below says, “If the file $3.gff exists”:
if [ -e $3.gff ]
# file exists
echo 'we have a reference genome'
# file does not exist
echo 'no reference genome'