BINF 6215: Using bpipe
bpipe is part of a relatively recent trend to build pipeline manager tools that work at the command line. Other examples of this trend are Snakemake, Leaf and nestly. You may want to consider using one of these other systems for building command line pipelines, but bpipe is a good place to start. bpipe is extensively documented, and makes it relatively easy for you to get started building shell script components and turning them into a collection of functional, reusable modules. bpipe is a Domain Specific Language (DSL) extension to Groovy, designed to run in the Java Virtual Machine. But you don’t need to know anything beyond shell scripting in order to begin using bpipe, and you don’t need anything but Java installed to run it.
Set up bpipe
For a single-user installation of bpipe, you can download the current release candidate and unpack it in your Applications directory. You could move the executable to a directory that is in your path, or add the following alias to your ~/.bash_profile and then “source ~/.bash_profile” to add the alias to your environment:
It’s critical that you use single quotes, not double quotes or backticks, in this alias, because your intention is to be able to use it literally as part of a command line, rather than to have it interpreted.
To test that your bpipe install is working, let’s use one of the examples from the bpipe documentation. Make a hello world script:
Run it by typing:
bpipe run helloworld.pipe
You should see an output to the screen of two pipeline stages completing successfully.
Convert your assembly pipeline
The second bpipe tutorial shows you how to pass input from one stage of your pipeline to the next, without specifying any aspects of the filename other than the extension. Go do that tutorial on your “hello world” script as shown, and then we’ll do a real bioinformatics example. If you think about the chloroplast assembly pipeline we built yesterday, the first four stages could easily be converted to filtering stages in bpipe. The fastx tools can read from standard input and write to standard output, so making these tools into a pipeline is very straightforward.
First, you run fastx_clipper to remove the A1 adapter. Let’s call that stage “clipa”. Then you run fastx_clipper to remove the TrP1 adapter. That stage is “clipb”. Then you run fastx_trimmer “trim” and fastq_quality_filter “qual”. If you exec the exact commands in your original, unenhanced script before we added all the fancy variables and loops, your bpipe will look like this:
Make the script and see if it runs. You can even verify by comparing your output files to your original command line outputs to make sure they are the same. (unix command? diff)
Pass variables from stage to stage
Following the examples in the tutorial, and modify your script to use bpipe-style $input and $output variables. Make sure that you require the files to be *.fastq files. If you want to be creative, you can turn the adapter sequences into variables as well and set their values at the beginning of the script. I am not going to put the answer here, but if you do this right it will produce the following output when run on one of your files. The command line is bpipe run yourscript.pipe yourfilename.fastq:
What about when the output is a directory?
From the bpipe documentation, can you figure out what to do when the output is a directory and the input file is inside the output directory? Again, I won’t post the solution, but I will tell you it’s pretty straightforward. Add the SPAdes and Quast steps to your pipeline and try to figure it out.
To get an idea why bpipe might be especially helpful when working through long processes, try the following. Edit your script to execute only the first (trimming) steps in the pipeline. Run it. Then edit it again to add the spades and quast steps. Then restart the pipeline with “bpipe retry” instead of “bpipe run”.
Ah, that awesome feeling when you leave the most important part out of the tutorial and end up recreating it on the fly. To run this on multiple files, you need to use bpipe’s pipeline math.
This is a really simple version, running all pipeline stages on all the fastq files in the input. Multiply this pipeline by however many there are. The % character is the wildcard inside bpipe. Far fancier stuff can be done (such as taking everything in your input directory and parceling out the *.sam files to one set of commands and the *.fastq files to a separate set of commands) but run this one with bpipe run script.pipe * and it will process all your fastqs.