Mapping short sequence reads to a reference sequence is a common task in genomics. Many different results can be extracted from a mapped sequence, depending on the original experimental design that produced the sequence reads and on the analysis that follows the mapping. For example:
- a genomic consensus for an individual (against the reference genome for that species)
- location of SNPs and other variations in one genome relative to the other
- location of expressed transcripts (coding mRNAs, noncoding RNAs such as miRNAs and lncRNAs)
- quantity of expressed transcripts (for differential expression analysis)
- location of protein binding sites detected by assays such as ChIP-seq
However, all of these analyses ultimately come back to mapping, and their outcomes depend on the accuracy of the mapping process, on how non-uniquely mapping reads are counted, and on the quality of the reference annotation used to interpret the mapped reads.
Commonly-used tools for mapping
Just like for any other bioinformatics task, there are numerous tools that we can use to map reads and then manipulate the results. However, if you look closely at a lot of read mapping and analysis software, much of it is built on one core package — bowtie — which represented a major speedup and step forward in mapping through its application of the Burrows-Wheeler Transform to the mapping process. Bowtie2 can be used directly or in conjunction with other packages that add on features and post-alignment analysis. And certain operations for manipulating the resulting sequence alignments, which are standardized into SAM and BAM formats, are best done using the samtools package. So today we will learn:
- Mapping reads with bowtie2
- Converting a SAM into a BAM (binary, smaller file, same information) using samtools
- Indexing a BAM using samtools
- Loading a BAM and corresponding reference genome into a genome browser
- Producing a consensus sequence from a BAM file
Mapping data formats
SAM is the canonical alignment file format. A SAM record contains the following information:
First there is a header that tells you what’s what. This is an unsorted SAM file, against the reference sequence NC_007898.3, which is 155461 nucleotides long. It was produced by bowtie2 Version 2.2.2 and you can even see the complete command line that produced it.
Then there are individual sequence records. They start with the query sequence identifier, then the bitwise flag (which encodes characteristics of the sequence such as orientation, pairedness, whether it passed filtering, etc), the reference sequence identifier, the leftmost mapping position, the mapping quality, the CIGAR string (which describes how the alignment relates to the reference), the reference name and position of the mate or next read, and the observed length, sequence, and quality string describing the template. Gory details of bitwise and CIGAR encoding can be looked up in the SAM specification linked above. You won’t need to know those unless you’re making software that will parse them.
And now, your mission
Using the chloroplast sequences in the Dropbox, along with the chloroplast reference fasta sequence, map your reads to the reference and try out some basic manipulations.
First, you’ll need to build a bowtie2 index of the reference genome. Comparing your reads to a BW index rather than to the whole genome is what allows bowtie to be so fast. Both bowtie2 and bowtie are on the computers in the lab; use the bowtie2 versions of everything and don’t confuse them, because bowtie indices are not compatible with bowtie2:
bowtie2-build indexbasename.fasta indexbasename
This should produce several files with extension .bt2 in your working directory.
Then run bowtie2:
bowtie2 -x indexbasename -U reads.fastq -S reads.sam
Like all of our other command line programs, running bowtie2 -help will give you the rundown on what the various options mean. Here, we’re only telling it to run with an index called ‘indexfilebasename’, one file of (-U) unpaired reads, and to write output to a (-S) SAM called reads.sam.
Of course before you input them into this step, your reads should be trimmed, because having incorrect sequence attached to them may keep them from mapping in the right place.
SAM conversion and sorting
Next, you should convert your SAM file to a BAM file. SAM files are huge, and converting them to BAMs will save you some space. Also, many downstream programs operate on BAM files. To convert a SAM to a BAM, you simply use the view subcommand in samtools:
samtools view -bS input.sam output.bam
A lot of applications (like browser viewing) expect the reads in the BAM file to be sorted, that is, the reads are ordered from “left” to “right” across the reference genome just as you’d read across a line in a book. There are other ways to sort reads, but the default samtools sort is the simple coordinate-based sort. To sort a BAM, you use the sort subcommand:
samtools sort input.bam output-srt.bam
samtools commands are capable of streaming, so if you already know how to use UNIX pipes, you could pipe those two commands together and save yourself creating an intermediate BAM:
samtools view -uS input.sam | samtools sort - output-srt.bam
Find the IGV or IGB browser on your system and try to load the sorted BAM file into it. What does the browser need to open up this file?
Create an index from your BAM with the samtools index subcommand:
samtools index in.bam
This will produce an index for your BAM named in.bai. Now try again to open your file in the genome browser. There’s not a ton to see there, but you should be able to get an idea whether there are any big, gaping holes in your genome coverage on this particular chloroplast data set, for instance.
Creating a consensus sequence
Finally, from your sorted BAM, you can produce a pileup and convert it to a consensus fastq:
samtools mpileup -uf reference.genome.fa infile.bam | bcftools view -cg - | vcfutils.pl vcf2fq > output.fq
The actual steps that are happening here are: making the pileup with samtools, using bcftools to convert the pileup into a vcf file, and then transforming the vcf into a fastq. What does that mean? You’re converting the read information from your BAM into, essentially, a multiple alignment to the reference, and then using the basic variant calling tools to encode information about SNPs and indels relative to the reference, and then squashing that information down into a fastq which is a single sequence with quality scores. We’ll learn about how to look at the variants and more ways to get information out of them in a future lab.
Finally, convert that fastq into a fasta sequence file:
~/Applications/seqtk/seqtk seq -a in.fq > out.fa
You can actually open that FASTA file to see how many sequences your consensus is broken into — they’re essentially contigs derived from a reference-based assembly rather than de novo assembly. Compare this to what you got for the same chloroplast data out of last week’s de novo assembly.