BINF 6203: Lab 1 — Orientation
The purpose of this lab exercise is to get you oriented to using genomic data sources, retrieving and converting files, loading files into CLC Genomics, etc. For this lab, you do not need to produce a lab report. You do have to stick around until you’ve verified with us that you’ve managed to finish these tasks.
What we’re assessing here is partly how comfortable you are just jumping in and using unfamiliar software or using the Mac in unfamiliar ways.
Download RNASeq data from the Short Read Archive. Steps:
- Go to the SRA website.
- Search for “Synechocystis”
- In the left menu, select RNA
Now, get some of the sets of primary reads on which this paper is based. There are 10 conditions — but you can just pick four. Get the stationary phase, the exponential phase, and two other conditions of interest. To do this:
- Find the first set of reads from the experiment (labeled primary reads from Synechocystis 6803 stationary phase)
- Click through to the individual page for that set of reads
- Click on the file size label (999.4 Mb) to get a list of downloadable files with the *.sra extension
- Click on the file to download it. Aspera Connect is supposed to be installed on your computer, and a special Aspera download window should open up. Choose a location for the file on your computer and save it there.
- Do the same for the other sets of reads. If you had lots of data to download, you could use the SRA Toolkit to get files as a batch rather than one by one.
- Convert the data to FASTQ format using the SRA Toolkit.
- Open a terminal window as shown in class.
- Change into the directory where you stored your *.sra files.
- Type fastq-dump filename and wait for conversion to finish. Do the same with the rest of your *.sra files.
Download the reference genome and reference annotation for the genome from EMBL (because we want the reference annotation in *.gtf format later).
- Go to the EBI Bacterial Genomes download page.
- Scroll through and find Synechocystis PCC6803 (the version tagged ASM972v1). Download the FASTA genome file and the GTF file for this genome.
Import the RNASeq data sets into CLC Genomics. Remember (or now you know!) that these data sets are single-end Illumina sequence data and they need to be imported accordingly.
- In your CLC top menu, choose Import / Illumina.
- In the interactive menus that come up, choose the *.fastq file that you want to import, and pick the appropriate menu choices
- Turn off paired end reads
- Choose Illumina pipeline 1.8 and later
- Choose “open” to open the imported file in the main CLC window
Map one of your RNASeq data sets to the reference genome in CLC Genomics.
- Use standard import to import your reference genome file.
- Use NGS Core Tools / Map Reads to Reference to map your reads to the reference genome. Select the appropriate files, leave everything else as defaults, and map your reads.
- Use the CLC Genomics Export tools to save your mapped reads as a *.sam file.