BINF 6215: command line challenge project
Now that you’ve worked at figuring out some command line software and made a basic bash script, we’re going to work on a better bash script. Your challenge (which may stretch over today and tomorrow’s work with bpipe) is:
- Start with the Synechocystis PCC6803 transcriptome data set
- Convert the SRA downloads into FASTQ files
- Prepare the sequences — trimming, single-pass digital normalization, etc
- Map the sequences to the reference genome
- Perform transcriptome comparison and basic differential expression analysis
Your scripting challenges are:
- Don’t do this all in one big “for” loop. Break the script down into sections
- Use at least one function that you put into your ~/.bash_profile, as discussed in the book
- Alert the user when script sections have finished, and prompt them to check the intermediate output and give permission for the script to proceed, especially if the next step is computationally expensive
- Once intermediate outputs are no longer needed, compress or delete them
- Put usage assumptions in comments — like if you take a variable from the command line, or if you assume a particular directory structure
Tip: You CAN use the Tuxedo pipeline (TopHat, CuffLinks) with bacterial RNA-Seq data, especially if you have an existing annotation. Pre-compiled executables are available for Mac.
Tip: You CAN test parts of your pipeline on the pre-normalized Vibrio vulnificus CMCP6 data that I’ve put out in the Dropbox, while waiting for the earlier stages to run.
Tip: You CAN run time-consuming parts of your pipeline once (like the SRA conversion and digital normalization) and comment them out once successful, so you don’t have to repeat them as you test later stages.
Tip: You are allowed to collaborate with a partner on the challenge project but your goal should be to get your components together into one working script.