BINF 6215: Download and work with SRA data
Refer back to this post for basic information about how to set up your machine with the right software and access the SRA. The tools you need (Aspera Connect, sratoolkit) should already be installed on your workstation. The data set you want is here.
This is a very small ChIP-Seq dataset generated by the Wadsworth Center. It will download a lot faster than other samples you may have tried.
In your notebook, answer the following questions about the data:
- What is the sequencing platform?
- Are the reads paired or single?
- Which samples are the experiment and which are the control?
- Describe the experiment and control. You may need to use Google Scholar to find out what the mutant is, since there is not a paper associated with this study yet.
Once you have the files downloaded, try to upload them into CLC Genomics. Can CLC Genomics accept input files in the SRA format?
So you need to convert your files into a format CLC can read.
You can convert these files to *.fastq format using the fastq-dump program in the SRA Toolkit. The simple syntax for what you want to do is just:
For the relatively small files we are working with here, the conversion should only take a few minutes. While you’re waiting, examine the fastq-dump help page to see what else fastq-dump can do for you.
Once each of your files has been converted, you can load the *.fastq files into CLC using the import tool as usual. I suggest you make a directory for them since we’ll analyze them separately from other sequences you’re downloading and importing.
Note: if you are working on a Mac where you have control of things, you can get the Homebrew package manager installed and then just type:
brew tap homebrew/science brew install sratoolkit
To get a working installation of the sra toolkit on your own machine.