Read mapping and variant calling
Mapping short sequence reads to a reference sequence is a common task in genomics. Many different results can be extracted from a mapped sequence, depending on the original experimental design that produced the sequence reads and on the analysis that follows the mapping. For example:
- a genomic consensus for an individual (against the reference genome for that species)
- location of SNPs and other variations in one genome relative to the other
- location of expressed transcripts (coding mRNAs, noncoding RNAs such as miRNAs and lncRNAs)
- quantity of expressed transcripts (for differential expression analysis)
- location of protein binding sites detected by assays such as ChIP-seq
However, all of these analyses ultimately come back to mapping, and their outcomes depend on the accuracy of the mapping process, on how non-uniquely mapping reads are counted, and on the quality of the reference annotation used to interpret the mapped reads.
Commonly-used tools for mapping
Just like for any other bioinformatics task, there are numerous tools that we can use to map reads and then manipulate the results. However, if you look closely at a lot of read mapping and analysis software, much of it is built on one core package — bowtie — which represented a major speedup and step forward in mapping through its application of the Burrows-Wheeler Transform to the mapping process. Bowtie2 can be used directly or in conjunction with other packages that add on features and post-alignment analysis. And certain operations for manipulating the resulting sequence alignments, which are standardized into SAM and BAM formats, are best done using the samtools package. So today we will learn:
- Mapping reads with bowtie2
- Converting a SAM into a BAM (binary, smaller file, same information) using samtools
- Indexing a BAM using samtools
- Loading a BAM and corresponding reference genome into a genome browser
- Producing a consensus sequence from a BAM file
Mapping data formats
SAM is the canonical alignment file format. A SAM record contains the following information:
First there is a header that tells you what’s what. This is an unsorted SAM file, against the reference sequence NC_007898.3, which is 155461 nucleotides long. It was produced by bowtie2 Version 2.2.2 and you can even see the complete command line that produced it.
Then there are individual sequence records. They start with the query sequence identifier, then the bitwise flag (which encodes characteristics of the sequence such as orientation, pairedness, whether it passed filtering, etc), the reference sequence identifier, the leftmost mapping position, the mapping quality, the CIGAR string (which describes how the alignment relates to the reference), the reference name and position of the mate or next read, and the observed length, sequence, and quality string describing the template. Gory details of bitwise and CIGAR encoding can be looked up in the SAM specification linked above. You won’t need to know those unless you’re making software that will parse them.
And now, your mission
Using the chloroplast sequences from SRA, along with the chloroplast reference fasta sequence, map your reads to the reference and try out some basic manipulations.
If you don’t have all of these on your computer already, remember you can get them by running fastq-dump <accession> (that’s the number starting with SRR in the list).
You have the choice to either install bowtie2, samtools, bcftools and seqtk on your own computer, via homebrew, or to use them on the cluster. Each of these packages are well supported in homebrew and install easily. The alignments are small jobs and should run pretty quickly.
First, you’ll need to build a bowtie2 index of the reference genome. Comparing your reads to a BW index rather than to the whole genome is what allows bowtie to be so fast.
bowtie2-build indexbasename.fasta indexbasename
This should produce several files with extension .bt2 in your working directory.
Then run bowtie2:
bowtie2 -x indexbasename -U reads.fastq -S reads.sam
Like all of our other command line programs, running bowtie2 -help will give you the rundown on what the various options mean. Here, we’re only telling it to run with an index called ‘indexfilebasename’, one file of (-U) unpaired reads, and to write output to a (-S) SAM called reads.sam.
Before you input them into this step, your reads should be trimmed, because having incorrect sequence attached to them may keep them from mapping in the right place.
SAM conversion and sorting
Next, you should convert your SAM file to a BAM file. SAM files are huge, and converting them to BAMs will save you some space. Also, many downstream programs operate on BAM files. To convert a SAM to a BAM, you simply use the view subcommand in samtools:
samtools view -bS input.sam output.bam
A lot of applications (like browser viewing) expect the reads in the BAM file to be sorted, that is, the reads are ordered from “left” to “right” across the reference genome just as you’d read across a line in a book. There are other ways to sort reads, but the default samtools sort is the simple coordinate-based sort. To sort a BAM, you use the sort subcommand:
samtools sort input.bam output-srt.bam
samtools commands are capable of streaming, so if you already know how to use UNIX pipes, you could pipe those two commands together and save yourself creating an intermediate BAM:
samtools view -uS input.sam | samtools sort - -o output-srt.bam
Find the IGV or IGB browser on your system and try to load the sorted BAM file into it. What does the browser need to open up this file?
Create an index from your BAM with the samtools index subcommand:
samtools index in.bam
This will produce an index for your BAM named in.bai. Now try again to open your file in the genome browser. There’s not a ton to see there, but you should be able to get an idea whether there are any big, gaping holes in your genome coverage on this particular chloroplast data set, for instance.
Creating a consensus sequence
Finally, from your sorted BAM, you can produce a pileup and convert it to a consensus fastq:
samtools mpileup -uf reference.genome.fasta infile.bam | bcftools call -c - | bcftools consensus -f NC007898.fasta - > output.fasta
The actual steps that are happening here are: making the pileup with samtools, using bcftools to convert the pileup into a vcf file, and then transforming the vcf into a fastq. What does that mean? You’re converting the read information from your BAM into, essentially, a multiple alignment to the reference, and then using the basic variant calling tools to encode information about SNPs and indels relative to the reference, and then squashing that information down into a fasta with modifications based on the observed reads.
You can actually open that FASTA file to see how many sequences your consensus is broken into — they’re essentially contigs derived from a reference-based assembly rather than de novo assembly.
Automating this for 12 genomes
Only copy from this if you understand what it is and what it does. If you don’t, you can dig back in and read my old blog posts from the BINF 6215 summer pipeline bootcamp. This is my tested bash script to iterate over all of the *.fastq files in the current working directory. This works with the versions that you can install via homebrew. You need all software installed on your machine, and NC_007898.fasta and NC_007898.gff to be present for this to work as advertised.
#! /bin/bash # call this script with one parameter; reference genome optional # 1 - reference genome basename bowtie2-build $1.fasta $1 for f in *.fastq; do echo "File -> $f" b=`echo $f | sed 's/\.fastq$//'` echo "Working on -> $b" bowtie2 -x $1 -U $b.fastq -S $b.sam samtools view -uS $b.sam | samtools sort - -o $b.bam samtools index $b.bam samtools mpileup -uf $1.fasta $b.bam | bcftools call -c - | bcftools consensus -f NC007898.fasta - > $b.fasta done
Note: don’t forget what you learned in previous weeks — you’ll want to add a read QC (trimming) step in here before you run bowtie2.
Examining chloroplast variation
From a biological perspective this data is interesting because it represents organelle genomes from a closely related set of tomato cultivars. While all of these cultivars are the same species, tomatoes as different as the German Cherry and the Cherokee Purple have obviously undergone some divergence. Here, you are trying to see where that divergence is represented in the chloroplast sequence.
Your goal in this exercise is to find out:
- How does the genomic divergence between these cultivars show up in terms of individual sequence variations in the chloroplast genome?
- Are variations between the genomes primarily SNPs, or small insertions or deletions? We’ll go over the VCF file format in class so you can see how to tell.
- How are the individual SNPs and/or indels dispersed in the chloroplast genome? Do they disproportionately affect particular genes or regions?
- Can you see any evidence of heterogeneity in the chloroplast genome sequence in these samples (i.e. more than one variant at the same site, with respect to the reference genome?)
Use the sequencing data for all 12 samples to answer this question.
In your report
Describe the workflow of tools (most of which you have already learned) that you will need to carry out this project, starting from the beginning (assuming you have raw data and need to start from square one). For these data, it may be worth exploring the results of being more strict about quality score cutoffs than you were when the goal was mapping and creating a consensus. But there will also be a trade off with throwing away too much of a data set that is already rather small.
You may need to look at manual pages for some tools that we have already learned in order to tweak how you use them to get the right answers. Be thorough in your description of methods, so that another person could reproduce your work exactly.
Use genome browser plots, tables, or other figures to summarize your findings as appropriate.
One more thing to learn before you start
You’ve already learned most of the tools that you will need to use to complete this project. However, there’s one thing that we haven’t done yet. You need to run samtools mpileup on a bam file, the same thing we did as an intermediate step in the read mapping lab, and use bcftools to produce a variant call file, which you will use directly instead of proceeding to a consensus. This means following the approach in the read mapping lab, but stopping before you produce a consensus fasta with vcfutils.pl.
samtools mpileup -uf reference.genome.fa infile.bam | bcftools call -c filename.out
The bcftools call parameter settings are where you should look if you want to change how many alternate alleles you think are possible (which might be helpful in detecting multiple forms of the chloroplast genome). The current version of bcftools has two calling methods — the multiallelic caller and the standard caller (bedtools call -c). We are looking, of course, at data from a haploid genome and comparing it to a close reference. Thus you might expect that only one variant would exist in the sample and you do not need to use the multiallelic caller. There is some evidence that chloroplast sequences from a single plant may show chimerism (i.e. chloroplasts in different cells will have slightly different sequences, just like human cells at different places in the body can show chimerism due to different somatic mutations). If you used the multiallelic caller you might be able to see such phenomena if they occurred. You can set ploidy and choose which caller to use following the options in the current version of the documentation.
HINTS: If you remember from the bedtools lab a couple of weeks ago, a variant call file in VCF format can be overlapped with a GFF feature file for the same reference genome, and you can make the overlaps show which genes have variants. BCF or VCF files can also be loaded into the genome viewers (IGB, IGV) that you have tried out before.