BINF 6203: Accessing the Short Read Archive
Brief review of a practical matter. In the future you may need to get hold of sequence data that has been collected/deposited by other researchers. The main resource that you’ll use to get hold of datasets is the NCBI Sequence Read Archive.
In order to access data in the SRA, there are a few things you need.
1. Aspera Connect plug-in
This browser plug-in facilitates the transfer of large data sets from servers that use Aspera infrastructure. The transfer protocol is faster than http: or ftp:, and it’s vital to have this plugin if you want to access the SRA. It’s simple to install — just click and download.
2. sra tools package
This little software package will allow you to convert files from the SRA native format to other common formats, like *.fastq, that most analysis software can read. In order to install it, simply download the archive for your particular platform If you have superuser access to the machine you’re working on, you can install the package anywhere you like. If you’re constrained to put it somewhere within your home directory, just make a folder and expand the archive. Precompiled binaries for your platform are included in the /bin directory. If you know how, you could put the /bin directory into your shell path so you don’t have to type out the full path to the program you want to use.
Above, I put the SRA toolkit in my home directory applications folder, and used the above command line to access fastq-dump.
3. A basic understanding of how information is organized in the SRA
Here’s the deal. The SRA does not have the world’s best user interface. But it will do. Everything’s just all in there in one big aggregate collection. Below you see a list of studies that includes a genome sequencing project, several RNA-Seq projects, some mapping studies aimed at finding the locations of binding sites and nucleosomes, and a metagenomics study. All in one place.
Information is organized in a hierarchy. The top level is “Studies” which is the collection of all samples and runs for a particular published study. The second level is “Samples”, which means distinct biological samples. The final level is “Runs”, because in some cases you might use more than one sequencing run to sequence an individual sample. Each Study has a unique identifier (the DRP**** number at the left) and each individual sample and run within the study will have its own sub-identifier.
You can search the database in a few ways.
- By the unique accession number for a study or sample. Most authors will put at least the study accession number in the paper describing the study so if you know the data set you want this is the easiest way to find it.
- By Entrez keywords, including species names.
- By sequence, using BLAST. Considering the nature of short read data this may not be the most efficient way to find what you want.
- You can also get to SRA entries from other linked databases at NCBI, so if you are looking for studies related to a particular species you may find them that way.
If you choose an individual entry to examine from the big list of studies, you’ll come to an overview page for that experiment, with a brief description of the experiment and information about how many samples and runs are available:
If you select one of the samples, you will get a more detailed set of information about just the runs contained within that sample. Among other things, you will be able to get detailed information about the sample, the way the library was constructed, and the sequencing platform used. Careful about drawing conclusions about the paired-ness of your data from the glyph — make sure what you see there is reflected in the detail on the page about each individual run:
Finally, if you click through to the page for each individual run, you’ll see detailed metadata for the run:
And also a download tab where you can connect with Aspera and download the data: