BINF 6215: Intro to UNIX — automating a repetitive task
Let’s say you’ve got a whole bunch of fastq.gz files that you want to get md5 checksums for so that you can upload them into the Short Read Archive. You could, of course, type “md5 filename” and wait, type “md5 filename2” and wait, etc. until you were done. But in UNIX, you do not have to do that.
This lesson introduces three concepts that are important for UNIX file manipulation: wildcards, standard output, and redirection of output to a file.
Wildcards are special characters that you can use in pattern searches (Chapters 2-3 in Haddock and Dunn) and in UNIX file operations. We’ll deal with the pattern search kind later. Here we’re just going to use a wildcard to do an operation on all files with names that contain a certain pattern.
Standard output is where some programs send their output if you don’t specify an output file to write to. When you use the md5 command, on a file, the outcome is printed at the command line. If I run md5 on the “automation1.jpg” image file in my 6215-PREP directory, this is what prints at the command line:
MD5 (automation1.jpg) = 17ee06433b365ca4f96bf4482c275810
If I am running md5 on a whole list of files, I really don’t want to have to be scrolling up and down my terminal looking for the correct standard output that printed to the command line several commands ago. Instead, I’d like to save all my md5 checksums for all the files I am processing into one place. I can use output redirection to do this. Output redirection is super-easy. I just point an arrow “>” at the file where I want the output to go. I can type:
md5 automation1.jpg > automation-md5s.txt
My output from the command will end up right in that file. If I wanted to append the output of a second md5 run to the same file, I could then type:
md5 automation2.jpg >> automation-md5s.txt
The new output would be added at the end of the file. If I accidentally ask for an md5 checksum on a file that doesn’t exist, the program will print this instead:
md5: automation3.jpg: No such file or directory
That looks like standard output too (it printed at the command line!) but it’s not. It’s standard error. If you want to print standard error to a file, you can type:
md5 automation3.jpg &> automation-md5errors.txt
You can also use combinations of these redirects to give you one output file and one error file, or write both output and error to a single file.
But what if you don’t want to type all the commands one by one? Also easy in UNIX: Here’s how my files are organized in directories (as shown in the Finder). I have compressed fastq files for sixteen samples — and because I have paired end reads I have two files for each sample. They are organized in directories under the strain name, and then those directories are organized under two directories (C-type and E-type) describing the isolate type.
At the command line, I locate myself in the directory above by typing
If I just wanted to process all the files in the Vibrio-express directory, I could use my wildcard and type
But I don’t really want to do that, because I don’t want the md5 checksums for these directories, I want the md5 checksums for files that are two layers down in the directories. So the files I really want are all the files located in all the directories located under the all the directories in my current directory. I could express that as
But if I wanted to make sure I affected only all the compressed fastq files in all the directories located under all the directories in my current directory, I would use the command:
Because of course, being systematic UNIX people means that all our fastq files have the same file extension every single time, so we can address “only files of this kind” easily. (Learn this now, and live it forever.)
So our full command with the redirect included could be
md5 */*/*.fastq.gz > Vvexpression-md5s.txt &
The & at the end means “give me another command line while you do this in the background”. If you want to check progress while your files are being processed, you can actually “look” into the growing file with the “wc” command
The wc (word count) command will return
#lines #words #characters filename
So if you know that you are processing 32 files, and the first number is only 16, it will be a while before the process completes.