BINF 2111: Getting variable values from an input file (Lab)

BINF 2111: Getting variable values from an input file (Lab)

So far in your UNIX journey, you’ve learned how to:

It may not seem like it yet, but you have already learned a lot about common programming tasks. The last thing we’re going to do with bash, before we move into python, is to figure out how to parse script variables from an input file, as another approach to automation.

The input file that you’ll use comes from the short read archive, and what we’re going to automate is the process of extracting the run identifiers from that table, and renaming the output to the more recognizable library names. The file looks like this:

samplefile

Remember, from the class notes on Tuesday, you want fields 3 and 6 of this table — the Library name and the run code that starts with SRR.

To simply read in the lines in the file and echo them back to your screen, use the code snippet that we practiced in class on Tuesday.

#! /bin/bash
while read line          
do          
    echo -e "$line \n"         
done

To skip the first line of the input file that has the column headers instead of values, you can set up a while loop with a counter, and then use an if…then…else construct to process every line other than the first line.

#! /bin/bash

i=1
while read line
do

if [ $i = 1 ]
    then
    i=$i+1
    else
    libname="$( cut -d '    ' -f 3 <<< "$line" )"
    sraname="$( cut -d '    ' -f 6 <<< "$line" )"
    i=$i+1
fi

done

To get the variable values that you want, you’ll need to actually process the incoming line. You can use the UNIX built-in cut (which you practiced in last week’s homework assignment) to get the correct field out of the line for each variable.

    libname="$( cut -d '    ' -f 3 <<< "$line" )"
    sraname="$( cut -d '    ' -f 6 <<< "$line" )"

Once you’ve verified that you can get the variable names that you want, include the fastq-dump command in your script to actually get the files from the Short Read Archive.

fastq-dump -Z $sraname > $basename.fastq

What you should turn in:  a version of your genome assembly workflow script with this functionality integrated, rather than getting the file list from a filename expansion like we did last time.

Once you’re done with this task, you can use the rest of your lab time to get started on the new variant calling workflow.

Comments are closed.