UPDATED: Read Mapping
Mapping short sequence reads to a reference sequence is a common task in genomics. Many different results can be extracted from a mapped sequence, depending on the original experimental design that produced the sequence reads and on the analysis that follows the mapping. For example:
- a genomic consensus for an individual (against the reference genome for that species)
- location of SNPs and other variations in one genome relative to the other
- location of expressed transcripts (coding mRNAs, noncoding RNAs such as miRNAs and lncRNAs)
- quantity of expressed transcripts (for differential expression analysis)
- location of protein binding sites detected by assays such as ChIP-seq
However, all of these analyses ultimately come back to mapping, and their outcomes depend on the accuracy of the mapping process, on how non-uniquely mapping reads are counted, and on the quality of the reference annotation used to interpret the mapped reads.
Commonly-used tools for mapping
Just like for any other bioinformatics task, there are numerous tools that we can use to map reads and then manipulate the results. However, if you look closely at a lot of read mapping and analysis software, much of it is built on one core package — bowtie — which represented a major speedup and step forward in mapping through its application of the Burrows-Wheeler Transform to the mapping process. Bowtie2 can be used directly or in conjunction with other packages that add on features and post-alignment analysis. And certain operations for manipulating the resulting sequence alignments, which are standardized into SAM and BAM formats, are best done using the samtools package. So today we will learn:
- Mapping reads with bowtie2
- Converting a SAM into a BAM (binary, smaller file, same information) using samtools
- Indexing a BAM using samtools
- Loading a BAM and corresponding reference genome into a genome browser
Mapping data formats
SAM is the canonical alignment file format. A SAM record contains the following information:
First there is a header that tells you what’s what. This is an unsorted SAM file, against the reference sequence NC_007898.3, which is 155461 nucleotides long. It was produced by bowtie2 Version 2.2.2 and you can even see the complete command line that produced it.
Then there are individual sequence records. They start with the query sequence identifier, then the bitwise flag (which encodes characteristics of the sequence such as orientation, pairedness, whether it passed filtering, etc), the reference sequence identifier, the leftmost mapping position, the mapping quality, the CIGAR string (which describes how the alignment relates to the reference), the reference name and position of the mate or next read, and the observed length, sequence, and quality string describing the template. Gory details of bitwise and CIGAR encoding can be looked up in the SAM specification linked above. You won’t need to know those unless you’re making software that will parse them.
And now, your mission
Using the chloroplast sequences from SRA, along with the chloroplast reference fasta sequence, map your reads to the reference and try out some basic manipulations.
If you don’t have all of these on your computer already, remember you can get them by running fastq-dump <accession> (that’s the number starting with SRR in the list).
First, you’ll need to build a bowtie2 index of the reference genome. Comparing your reads to a BW index rather than to the whole genome is what allows bowtie to be so fast. The reference chloroplast genome is in the Dropbox or you can get it from NCBI using Genbank ID NC007898.3.
bowtie2-build indexbasename.fasta indexbasename
This should produce several files with extension .bt2 in your working directory.
If you build the bowtie2 index from the reference genome file, this is what the command would look like:
bowtie2-build NC_007898.fasta NC_007898
This command should produce this output, including file sizes:
Then run bowtie2:
bowtie2 -x indexbasename -U reads.fastq -S reads.sam
Like all of our other command line programs, running bowtie2 -help will give you the rundown on what the various options mean. Here, we’re only telling it to run with an index called ‘indexfilebasename’, one file of (-U) unpaired reads, and to write output to a (-S) SAM called reads.sam.
Before you input them into this step, your reads should typically be trimmed, because having incorrect sequence attached to them may keep them from mapping in the right place.
I started with the BC26 chloroplast file from SRA. Get it using fastq-dump:
Move it to a more informative filename:
mv SRR1763773.fastq BC26.fastq
Then you can run bowtie2:
bowtie2 -x NC_007898 -U BC26.trim.fastq -S BC26.sam
You are likely to find on the first try that mapping percentages are quite low on some of these reads — 3 of the 12 fastqs actually have mapping percentages around 30%. The Ion Torrent platform was notorious for low data quality, and the machine we had available in the department seemed to be especially bad. As we continue with the upcoming labs, you will get to experience mapping Illumina data, which is much higher quality, and you’ll see much better mapping percentages with those data sets.
However, using a sliding window trim in trimmomatic improved the mapping for all files, albeit only by a few percent. Some students also found that they were able to use FastX or cutadapt to remove these adapters, resulting in somewhat better outcomes than they got from Trimmomatic. If you want to try this, use the standard Ion Torrent adapter sequences shown below.
> A1 5'
> TRP1 3'