BINF 6203: Annotation (Part 1 — manipulating CLC annotations)
The purpose of this lab is to get a chloroplast genome sequence annotated with interesting information. In the dropbox, I have all the chloroplasts reloaded, renamed to short names so I am going to refer to them by their shortened names. I used sample #29, orange 14, file ionXpress_029.fastq, which is coincidentally the Heinz 1706-BG strain, because it had the most reads of any of the samples. When I mapped my reads to the reference chloroplast, the mapping track produced by CLC looked like this:
i.e. there were reads at mapped at some depth all along the reference genome, although considerably more for some amplicon regions than for others. I used NGS Core Tools:Extract Consensus Sequence to make a consensus sequence from the mapped reads. My consensus sequence is a single sequence the length of the reference genome with only a few ambiguous characters.
If I want to use this consensus sequence for anything else (for example, input to a genefinding program) I can export it in FASTA format, using the CLC File:Export tools. (Do this!)
If you look at the reference chloroplast sequence in GenBank, you can learn a few things about what kind of signals you’ll have to look for in your new, raw consensus sequence. There are a lot of single exon genes in there and just a few multi-exon genes. There are a lot of tRNA genes in there. For the purpose of the lab exercise, we’re going to treat the chloroplast using pretty simple annotation methods. This isn’t too far-fetched — the chloroplast’s origin is in endosymbiosis, in which a bacterium becomes a dependent constituent of a more complex eukaryotic cell. So we’ll treat our chloroplasts as little cyanobacteria for the purpose of getting some experience with annotation tools.
Finding open reading frames (ORFs)
CLC Genomics does not have all the genefinding tools you need built in, but it can do simple ORF finding for you as well as displaying track files created by other methods. To identify ORFs on your sequence and then work with them, you’ll have to take the following (rather cumbersome) series of steps:
- Open your saved consensus sequence.
- Select Classical Sequence Analysis:Nucleotide Analysis:Find Open Reading Frames (this finds your open reading frames)
- Select Track Tools:Convert to Tracks (this will change the way that CLC is handling your data, breaking apart the ORF information from the genome and treating each separately)
- Create both Sequence and Annotation tracks
- Click the + at the side of the “Annotation types” window and select the ORF track to be saved as an annotation. Choose to save them, not open them.
- In your track tools, now choose “create track list” and add both tracks. That will show your genome and your ORF track aligned to each other
- If you actually want to manipulate the sequences in your ORF track, you need to extract the sequences
- You also need to have saved the Genome track as a separate CLC entity because you’ll need it as a reference for extracting the annotations
- In Classical Sequence Analysis:General Sequence Analysis:Extract Annotations, you’re going to again pick the ORF track to extract and save it to a sequence list. This will allow you to work with the ORFs in the file as individual sequences
At the end of these operations you should see the following track files and sequence lists in your directory.
Follow a similar procedure to extract the CDS (coding sequence) track from your reference genome and convert it into sequences. You can then make a BLAST database out of the CDS track, and see which ORFs have sequence matches in the known CDS, or vice versa.