So far in your UNIX journey, you’ve learned how to:
- Run built-in and custom UNIX commands
- Put a series of commands together into a simple script that follows sequential logic, and construct a simple conditional that uses a test operator
- Introduce loops, user input, and more complex conditionals to automate your workflow
It may not seem like it yet, but you have already learned a lot about common programming tasks. The last thing we’re going to do with bash, before we move into python, is to figure out how to parse script variables from an input file, as another approach to automation.
The input file that you’ll use comes from the short read archive, and what we’re going to automate is the process of extracting the run identifiers from that table, and renaming the output to the more recognizable library names. The file looks like this:
Remember, from the class notes on Tuesday, you want fields 3 and 6 of this table — the Library name and the run code that starts with SRR.
To simply read in the lines in the file and echo them back to your screen, use the code snippet that we practiced in class on Tuesday.
#! /bin/bash while read line do echo -e "$line \n" done
To skip the first line of the input file that has the column headers instead of values, you can set up a while loop with a counter, and then use an if…then…else construct to process every line other than the first line.
#! /bin/bash i=1 while read line do if [ $i = 1 ] then i=$i+1 else libname="$( cut -d ' ' -f 3 <<< "$line" )" sraname="$( cut -d ' ' -f 6 <<< "$line" )" i=$i+1 fi done
To get the variable values that you want, you’ll need to actually process the incoming line. You can use the UNIX built-in cut (which you practiced in last week’s homework assignment) to get the correct field out of the line for each variable.
libname="$( cut -d ' ' -f 3 <<< "$line" )" sraname="$( cut -d ' ' -f 6 <<< "$line" )"
Once you’ve verified that you can get the variable names that you want, include the fastq-dump command in your script to actually get the files from the Short Read Archive.
fastq-dump -Z $sraname > $basename.fastq
What you should turn in: a version of your genome assembly workflow script with this functionality integrated, rather than getting the file list from a filename expansion like we did last time.
Once you’re done with this task, you can use the rest of your lab time to get started on the new variant calling workflow.