BINF 6215: ChIP-Seq workflow in CLC
On the first day of class, we downloaded a small ChIP-Seq data set and used it to practice using the SRA toolkit. Now we’re going to go back and analyze that. Your first step should be to go through the “Basic” and “Advanced” ChIP-Seq tutorials and see what kind of information you can extract for a single ChIP-Seq data set. In the ChIP-Seq analysis world, a “control” has a special meaning. It’s not just a comparison between conditions; the control for a ChIP-Seq analysis is a run in which no immunoprecipitation step was performed. We don’t have that data available for the small data set we are working with, so run the basic and advanced analysis without a control.
Peak Shape ChIP-Seq
However, the ChIP-Seq data set that we have actually has two conditions — a wild type and a mutant organism. In order to compare them, we actually have to use a CLC tool that is in beta right now and available as a plugin, the Peak Shape ChIP-Seq tool. This tool should already be installed in CLC on your computer, but if it’s not, you should be able to choose to install it.
The process for using the Peak Shape ChIP-Seq tool is a little different — with it, you can take your peak information and convert it to tracks that can be filtered and compared.
Step 0: Linearize the reference genome
In the beta version of the tool, attempting to use a circular (bacterial) reference genome triggered a bug. So the first thing we’ll do is make the reference genome linear. To do this:
- Open the reference sequence
- Click on the sequence name in the middle of the big circle
- Choose “Make Sequence Linear” from the menu
- Save the sequence as something like NC_000913_linear
- Convert the sequence into tracks so you have annotation track information for this version of the reference genome as well
This step can’t really go into your workflow — it’s not connected to the “input” of your workflow per se. Most of the following steps, however, can be connected together in a workflow. Use the information from the CLC Batch Mode and CLC Tracks manual sections and try to maximize the number of steps you can combine into a workflow.
Step 1: QC your reads
The nominal length of reads in this run is 50; you should not really expect to see anything much larger than that, so use that information as a guide when you are filtering.
Step 2: Map reads to the reference
You can use the default parameters here. Be sure that you select your linearized reference genome as the reference, and that you output standalone read mappings as well as reads tracks.
Step 3: Run the peak shape analysis
The Peak Shape ChIP-Seq tool is in beta. It doesn’t do nifty stuff like running statistics on your replicate samples. So you’ll want to try to set up your input folders for batching so that the two replicates for each condition get pooled together. You could do this at this stage or at the mapping stage.
You’ll want to save your Peaks track. There are things you can do with your Peak Shape Filter as well, so if you want to experiment with downstream analyses, save that too.
Step 4: Convert mappings to tracks
You can turn your read mappings into read tracks with annotations so that they can be viewed in a combined track list along with the peak tracks.
Step 5: Filter and compare tracks
For the purposes of getting a simple list of peaks that are different between our hns mutant and wildtype samples you can use the track filtering tools. This really isn’t a substitute for a statistical analysis, but it’s what CLC can do at this time. Later in the class we’ll look at this data set with more rigorous peak analysis tools. Use the Filter Based on Overlap tool in the Track Tools to make two overlap track lists — one with hns as the query and wt as the reference, and one with wt as the query and hns as the reference.
Step 6: Combine and display tracks
Create track lists for each condition that combine the read mapping track, the peaks track, and some key annotation tracks, along with your differential tracks.
In your notebook answer the following:
- In which sample are there more peaks?
- Are there changes in both “directions” or are the changes all one way?
- hns is a DNA binding protein. Other than the one mutation under study, the mutant is isogenic with the wild type. What affect did the hns mutation have?
- Which parts of this process can be encoded in a single workflow using CLC?
- Which parts can not?
- For the parts that can not be encoded in workflow or where you must separate steps, what would the Workflow tool need to allow in order to automate them (what additional functions would it need)?
- CLC ChIP-Seq tutorial: The Basics
- CLC ChIP-Seq tutorial: Advanced
- CLC Peak Shape Chip Seq Analysis Manual
- Kidder et al on technical issues in ChIP-Seq
- DIME paper (R package for differential identification)