Browsed by
Category: BINF 6215

BINF 6215: Trinity and Corset at the command line

BINF 6215: Trinity and Corset at the command line

This tutorial is my version of the workflow for analysis of the Synechocystis PCC6803 gene expression data using Trinity and Corset. Disclaimer: walking through the workflow shows that there is plenty to be skeptical about in this dataset or at least in the stat and exp samples. So we should definitely talk about what to look for. Define the problem To start off with, let’s just pretend that this Synechocystis data set has no reference genome. How will we determine where the reads belong…

Read More Read More

BINF 6215: Basic variant calling in Galaxy

BINF 6215: Basic variant calling in Galaxy

Remember the chloroplast variant calling tutorial? Turns out, you can implement the same thing in Galaxy. (Sort of). Since the chloroplast files are very small, I’m recommending you do this on Galaxy Main, because some of the tools in this sequence mysteriously did not work in my own Galaxy, and those are not always issues that can be fixed in real time. Identify the tools If you think back to the tools involved in our simple variant calling pipeline, the major steps…

Read More Read More

BINF 6215: command line variant calling

BINF 6215: command line variant calling

Define the problem Ion Torrent sequence for 12 tomato varietal chloroplasts One reference genome (NC_007898) Map the reads to the reference Identify variants specific to each strain Identify the tools Obviously you’ve done some of this work already in the previous example.  For the steps after data cleaning, that’s where we’re going to learn something new. If you think back to the CLC tools involved in variant calling, after cleaning there is a mapping step, and then the mapping is…

Read More Read More

Galaxy NGS 101: Synechocystis remix

Galaxy NGS 101: Synechocystis remix

I made a previous version of this tutorial with some actual expression data from our lab. This version uses an already-normalized set of single-end transcriptome data from Synechocystis PCC 6803. (Normalized data here). Trimming and normalization was done using Trimmomatic and khmer (single-pass), and has been discussed in other posts. The dataset size is more manageable on Galaxy Main and on our lab computers than the V. vulnificus data. Based on conversation with Titus Brown, the normalization procedure used is likely…

Read More Read More

BINF 6215: Using bpipe

BINF 6215: Using bpipe

bpipe is part of a relatively recent trend to build pipeline manager tools that work at the command line. Other examples of this trend are Snakemake, Leaf and nestly.  You may want to consider using one of these other systems for building command line pipelines, but bpipe is a good place to start. bpipe is extensively documented, and makes it relatively easy for you to get started building shell script components and turning them into a collection of functional, reusable modules. bpipe is…

Read More Read More

BINF 6215: command line challenge project

BINF 6215: command line challenge project

Now that you’ve worked at figuring out some command line software and made a basic bash script, we’re going to work on a better bash script.  Your challenge (which may stretch over today and tomorrow’s work with bpipe) is: Start with the Synechocystis PCC6803 transcriptome data set Convert the SRA downloads into FASTQ files Prepare the sequences — trimming, single-pass digital normalization, etc Map the sequences to the reference genome Perform transcriptome comparison and basic differential expression analysis Your scripting challenges are:…

Read More Read More

BINF 6215: Building a bash shell script

BINF 6215: Building a bash shell script

At this point you should have a series of command lines that you have vetted by manually testing the complete workflow on one of your chloroplast sequences.  Here’s my solution — I used the FASTX Toolkit for trimming before running SPAdes and QUAST.  In my test of this pipeline, the adapter clipping steps reduced my data set from 213112 reads to 209076 reads. (Answers will appear after you’ve attempted this on your own) I viewed the trial data set with FastQC before…

Read More Read More

BINF 6215: Figuring out UNIX command line software

BINF 6215: Figuring out UNIX command line software

The most common things that you’ll want to do with shell scripts in bioinformatics are 1) data manipulation (which is what we practiced this morning) and 2) driving programs to run automatically, collecting their output, and feeding it into other programs. The little UNIX commands that are part of your operating system are powerful, but usuallly you’ll end up wanting to run programs that other people have developed. In order to successfully build a script that drives a bioinformatics pipeline,…

Read More Read More

BINF 6215: Galaxy NGS 101

BINF 6215: Galaxy NGS 101

This tutorial draws on some of the online Galaxy tutorials (here) and videos (here) but I have made some of the steps more explicit for you with screenshots. Galaxy data formats You can think of Galaxy’s data formats in a few main categories: Sequence data (FASTA, FASTQ) Alignments (BLAST results, SAM and BAM files) Track data (genomic intervals, WIG files — continuous value tracks) Tabular data (many kinds — interval data is a subset) The key thing is, though, if your…

Read More Read More

BINF 6215: Shell scripting 101

BINF 6215: Shell scripting 101

Shell scripting is a powerful way to string commands together, make commands repeat themselves on a list of files, and all manner of other useful conglomerations of function. Scripting isn’t really “programming” per se — it’s one level of abstraction above. A script is just a file that contains a list of commands that you want to execute, in the order you want them executed. It’s also possible to create simple loops in shell scripts so that commands can be repeated….

Read More Read More