BINF 6215: Building a bash shell script

BINF 6215: Building a bash shell script

At this point you should have a series of command lines that you have vetted by manually testing the complete workflow on one of your chloroplast sequences.  Here’s my solution — I used the FASTX Toolkit for trimming before running SPAdes and QUAST.  In my test of this pipeline, the adapter clipping steps reduced my data set from 213112 reads to 209076 reads.

(Answers will appear after you’ve attempted this on your own)

I viewed the trial data set with FastQC before deciding on my trim parameters. In these datasets, the first 8 bases and everything out in the long tail past 300 or so have really screwy base composition so I decided to get rid of those outright. And we know that Ion Torrent quality scores are calibrated somewhat low — the distribution centers around 20 rather than up at 33 or so like Illumina data, but that actually doesn’t mean you have unusable data. So we set the q-score threshold rather lower.  That’s the rationale behind the trim and quality parameters in the next two commands.

fastqcdesignclip

I was pretty lenient but at the end I still had only 175692 of the original reads. Then I ran SPAdes and QUAST.

Hidden until needed…

In this case I could give QUAST a reference genome to compare to, so I did.

Getting set up

To start with, in your sandbox, create a “chloroplasts” directory and copy all of the summer 2014 chloroplast fastq files into it. Also copy the reference sequences and the “adapters.tab” file in case you need those sequences for the adapter trimming method you chose.

Building the basic script

The most basic form of our script is to put the six commands together in one file, one after another. Remember, the file needs to start with the first line #! /bin/bash. In this version of our script, we’re barely automating — all of the filenames are hardcoded and the script will only run on one of our chloroplast files without editing it to change the filenames. Nonetheless, this is one step beyond typing each of the commands at the command line manually.  Here’s what mine looks like:

bashscript1

Yep, that’s ugly. But it works. Build yours, run it. Then we’ll edit to make it less ugly and more functional. You should know what your output is expected to look like from each step in your sequence of commands so you can assess if it ran properly. If you redirect output from “sh yourscript.sh” into a file, you can have a record of all the stdout from the different steps, for posterity.

I also made a little script called testcleanup.sh to quickly clean up the outputs of my test runs. You can modify this as you go as the outputs you’re accumulating change.

bashscript2

Making your script keep you informed

A really simple change that we can make to the assembly script is to accumulate the standard output of each of the steps into a file, and make the script send progress updates to stdout instead. You can use the >> (append redirect) to send all your steps’ output to the same file, and the echo command (p. 85-86 and p. 312 in your book) to send yourself a little message from your script.

Here’s what this looks like added to my ugly script:

bashscript3

And here is what stdout looks like when my script runs now:

bashscript4

Getting filenames from user input

What would make this script really useful would be to have it take a filename as input and name its output files automatically to match, instead of having filenames hardcoded. Bash makes this easy. If you run a shell command followed by arguments, the command line arguments automatically are used as variables $1 $2 $3 $4…etc. These can then be substituted for their values in your script.

So if I want my chloroplast script to take input from the command line, what would be useful to take?  Well, there are two repeating patterns that occur throughout the script.  The base name of the input fastq file, and the base name used to name the directories that SPAdes and Quast make. So I could set the script up to use those two variables, like so:

sh cp-takeinput.sh BC21_BINF6350_Summer2014_13pm BC21

Thus BC21_BINF6350_Summer2014_13pm becomes $1 and BC21 becomes $2. Bash Guru has more on this if you want to know how large numbers of arguments work.

Now that I’m using variables, this is what my script looks like:

bashscript5

You’ll notice I added a couple of commented lines at the top, telling the user how to use the script. The more you can document internally in your script, the more useful it will be to others and to you, a month from now, when you forget why you made it.

Being able to automate like this is a great reason to consistently use file basenames and file extensions to organize your project, and to organize files hierarchically so that scripts can easily find them.

Conditionals

This script makes a lot of assumptions. It assumes that every file the user points it to will actually be present. It assumes that the different commands in the script will run exactly the same way every time. But you can edit the script to protect against this, at least somewhat.

In the previous example, there is a third variable that we could have added on our command line — we could turn the base name of the reference genome sequence files that Quast uses into a variable. But sometimes there is no reference genome and yet you want to run Quast anyway.  If we allow for the option of a third command line parameter $3, we can use a conditional statement to test for it.  This is covered in the functions section on page 312 of your book.  The necessary code would be something like this:

if [ $3 ]
then
quast -o $2quast -R $3.fasta -G $3.gff $2spades/contigs.fasta >> $2.all.stdout
else
quast -o $2quast $2spades/contigs.fasta >> $2.all.stdout
fi

Now we’re getting to a point where we really don’t want to run the whole test script every time. So, to test your conditional, comment out the actual commands that the bioinformatics analyses (or at least SPAdes and Quast which are the time consuming ones), and insert echo lines that tell you the outcome instead, like so:

bashscript6

When this runs the output will look like this:

bashscript7

There are a lot of different things that we could test for with a bash script. Here, we tested to see if a variable exists.

  • We could test to see if a file exists before starting a process.
  • We could test to see if standard error returned empty when a process ran (i.e. it completed without error).
  • We could test to see if a variable exceeded a specified value.

Repeating the same steps on a list of files

Now we’re ready to try the big Kahuna of automation — running the same script steps on a list of files. It would actually be totally functional to do this by making a master script that runs your current script over and over, like so:

#! /bin/bash
sh cp-conditional.sh basename1 dirname1 reference1
sh cp-conditional.sh basename2 dirname2 reference2
sh cp-conditional.sh basename3 dirname3 reference3

But there are better ways to make this happen from inside your main script.  Let’s start building this one up from a different base, though. Create a small script called simpleloop and run it in your fastq directory to make sure it gives you a list of your fastq files.

#! /bin/bash
for f in *.fastq; do
echo "File -> $f"
done

Now we need to modify this and convert the $f variable (the filename) into the equivalent of the variables that we used in the previous versions of the script. We need an equivalent of our former script’s $1 and $2. The only thing on the command line in this version will be the name of the reference genome file.

To get rid of the extension in the $f variable, converting it into the equivalent of our file basename, we can use a sed expression. Sed is the stream editor, another incredibly powerful command line tool that’s essentially a whole language in itself. (If you’re beginning to wonder where the bottom is in UNIX, know this: there is no bottom. You will keep finding new tools until you die.)

The sed command that we want is essentially the equivalent of search and replace in vi. We want to search variable $f (the filename) for “.fastq”, and replace it with nothing.

b=`echo $f | sed 's/\.fastq$//'`

Variable b is the equivalent of $1 in our old script.

The next command that we want is to take all the text up to but not including the first underscore delimiter. This will give us a variable that is the equivalent of variable $2. The UNIX cut command is one way to do this. Cut can look for specific characters in a line but it can also work on a delimiter character. Here you want to cut on underscore and get field 1 from variable $f.

c="$( cut -d '_' -f 1 <<< "$f" )"

So now we can add these commands to our simple loop script and also some echo commands to make sure we’ve got the correct content in our $b and $c variables.

#! /bin/bash
for f in *.fastq; do
echo "File -> $f"
b=`echo $f | sed 's/\.fastq$//'`
echo "Basename -> $b"
c="$( cut -d '_' -f 1 <<< "$f" )"
echo "Dirname -> $c"
done

Try that. Now you can imagine what needs to happen next, right? Inside the do loop, instead of printing out the contents of variables, we need to insert the actions of the original script. And where we had $1 in the original script will be $b, where we had $2 will be $c, and where we had $3 will be $1. We will run the looping chloroplast script with one command line argument, the reference genome (if available).

Make it happen. First, test it with the computationally costly steps commented out. Then maybe move all but two of your *.fastq files somewhere else temporarily and try it on just a couple of files with all the steps turned on. If it does everything it’s supposed to do, run it on all your chloroplast files.

The stuff we’ve worked on here, combined with UNIX commands for file processing between steps, should probably take care of about 75% of your shell scripting needs. This script’s not the most elegant, nor is it idiot-proofed against all possible failures, but it will definitely get you started on getting stuff done.

If you got through this super-fast, work through the exercises with shell functions in your book, pp 308-318.

Comments are closed.