Read mapping and simple variants
Mapping short sequence reads to a reference sequence is a common task in genomics. Many different results can be extracted from a mapped sequence, depending on the original experimental design that produced the sequence reads and on the analysis that follows the mapping. For example:
- a genomic consensus for an individual (against the reference genome for that species)
- location of SNPs and other variations in one genome relative to the other
- location of expressed transcripts (coding mRNAs, noncoding RNAs such as miRNAs and lncRNAs)
- quantity of expressed transcripts (for differential expression analysis)
- location of protein binding sites detected by assays such as ChIP-seq
However, all of these analyses ultimately come back to mapping, and their outcomes depend on the accuracy of the mapping process, on how non-uniquely mapping reads are counted, and on the quality of the reference annotation used to interpret the mapped reads.
Commonly-used tools for mapping
Just like for any other bioinformatics task, there are numerous tools that we can use to map reads and then manipulate the results. However, if you look closely at a lot of read mapping and analysis software, much of it is built on one core package — bowtie — which represented a major speedup and step forward in mapping through its application of the Burrows-Wheeler Transform to the mapping process. Bowtie2 can be used directly or in conjunction with other packages that add on features and post-alignment analysis. And certain operations for manipulating the resulting sequence alignments, which are standardized into SAM and BAM formats, are best done using the samtools package. So today we will learn:
- Mapping reads with bowtie2
- Converting a SAM into a BAM (binary, smaller file, same information) using samtools
- Indexing a BAM using samtools
- Loading a BAM and corresponding reference genome into a genome browser
Mapping data formats
SAM is the canonical alignment file format. A SAM record contains the following information:
First there is a header that tells you what’s what. This is an unsorted SAM file, against the reference sequence NC_007898.3, which is 155461 nucleotides long. It was produced by bowtie2 Version 2.2.2 and you can even see the complete command line that produced it.
Then there are individual sequence records. They start with the query sequence identifier, then the bitwise flag (which encodes characteristics of the sequence such as orientation, pairedness, whether it passed filtering, etc), the reference sequence identifier, the leftmost mapping position, the mapping quality, the CIGAR string (which describes how the alignment relates to the reference), the reference name and position of the mate or next read, and the observed length, sequence, and quality string describing the template. Gory details of bitwise and CIGAR encoding can be looked up in the SAM specification linked above. You won’t need to know those unless you’re making software that will parse them.
And now, your mission
Using the chloroplast sequences from SRA, along with the chloroplast reference fasta sequence, map your reads to the reference and try out some basic manipulations.
If you don’t have all of these on your computer already, remember you can get them by running fastq-dump <accession> (that’s the number starting with SRR in the list).
Indexing
First, you’ll need to build a bowtie2 index of the reference genome. Comparing your reads to a BW index rather than to the whole genome is what allows bowtie to be so fast. The reference chloroplast genome is in the Dropbox or you can get it from NCBI using Genbank ID NC007898.3.
bowtie2-build indexbasename.fasta indexbasename
This should produce several files with extension .bt2 in your working directory.
If you build the bowtie2 index from the reference genome file, this is what the command would look like:
bowtie2-build NC_007898.fasta NC_007898
This command should produce this output, including file sizes:
Mapping
Then run bowtie2:
bowtie2 -x indexbasename -U reads.fastq -S reads.sam
Like all of our other command line programs, running bowtie2 -help will give you the rundown on what the various options mean. Here, we’re only telling it to run with an index called ‘indexfilebasename’, one file of (-U) unpaired reads, and to write output to a (-S) SAM called reads.sam.
Before you input them into this step, your reads should typically be trimmed, because having incorrect sequence attached to them may keep them from mapping in the right place.
I started with the BC26 chloroplast file from SRA. Get it using fastq-dump:
fastq-dump SRR1763773
Move it to a more informative filename:
mv SRR1763773.fastq BC26.fastq
I did a trial run and got only 30% of reads mapping. Ideally we’d use a trimming program to remove the Ion Torrent adapters from these reads. The FASTX programs would work to do this, or cutadapt. You might even be able to talk Trimmomatic into doing it if you give it the right sequences:
> A1 5'
CCTCTCTATGGGCAGTCGGTGAT
> TRP1 3'
CCATATCATCCCTGCGTGTCTCCCACTCAG)
Then I ran bowtie2:
bowtie2 -x NC_007898 -U BC26.trim.fastq -S BC26.sam
SAM conversion and sorting
Next, you should convert your SAM file to a BAM file. SAM files are huge, and converting them to BAMs will save you some space. Also, many downstream programs operate on BAM files. To convert a SAM to a BAM, you simply use the view subcommand in samtools:
samtools view -b BC26.sam > BC26.bam
A lot of applications (like browser viewing) expect the reads in the BAM file to be sorted, that is, the reads are ordered from “left” to “right” across the reference genome just as you’d read across a line in a book. There are other ways to sort reads, but the default samtools sort is the simple coordinate-based sort. To sort a BAM, you use the sort subcommand:
samtools sort input.bam output-srt.bam
samtools commands are capable of streaming, so if you already know how to use UNIX pipes, you could pipe those two commands together and save yourself creating an intermediate BAM:
samtools view -uS input.sam | samtools sort - -o output-srt.bam
The extra – in that line is using streaming standard input from the first half of the pipe as the input to samtools sort. Find the IGV or IGB browser on your system and try to load the sorted BAM file into it. What does the browser say it needs to open up this file?
BAM indexing
Create an index from your BAM with the samtools index subcommand:
samtools index in.bam
This will produce an index for your BAM named in.bai. Now try again to open your file in the genome browser. There’s not a ton to see there, but you should be able to get an idea whether there are any big, gaping holes in your genome coverage on this particular chloroplast data set, for instance.
samtools index BC26.sort.bam
At the end of this sequence of commands I had the following files for BC26, including intermediate results and the index. The index should be small. It’s cool:
I was able to load this all into IGV with no difficulties. You “Load Genome From File” (in the Genomes menu) NC_007898.fasta first, then “Load From File” (in the File menu) NC_007898.gff (if you want to see the genes) and BC26.sort.bam. This is what it will look like when you zoom in sufficiently.
Now to get the pileup and vcf, these steps worked with the sorted bam. There are currently 2 versions of mpileup, it is in both bcftools and samtools. They are supposed to be backwards compatible but I switched to using the newer bcftools version just in case. So I could get a look at what was being produced, I added a -v in there to produce a VCF rather than a BCF as intermediate output, and the VCF was 112 lines, which seems reasonable based on prior analysis of the data. There’s not a huge # of variants in the chloroplast:
bcftools mpileup -f NC_007898.fasta BC26.sort.bam | bcftools call -c -v > BC26.sort.vcf
You’ll be able to load this VCF file along with your other files into IGV in a 2-step process. First, go to the Tools menu and select igvtools and then Index. Your input file is your VCF. Once indexed, the VCF will display in the browser view. You can index the 12 different VCF files from your different samples to the same reference and display them together.
You have the choice to either install bowtie2, samtools, and bcftools on your own computer, via homebrew, or to use them on the cluster. Each of these packages are well supported in homebrew and install easily. The alignments for this small dataset are small jobs and should run pretty quickly.
Creating a consensus sequence
Finally, from your sorted BAM, you can produce a pileup and convert it to a vcf:
bcftools mpileup -uf reference.genome.fasta infile.bam | bcftools call -c -v - > output.vcf
The actual steps that are happening here are: making the pileup with samtools, and using bcftools to convert the pileup into a vcf file. You’re converting the read information from your BAM into, essentially, a multiple alignment to the reference, and then using the basic variant calling tools to encode information about SNPs and indels relative to the reference.
Automating this for 12 genomes
Only copy from this if you understand what it is and what it does. If you don’t, you can dig back in and read my old blog posts from the BINF 6215 summer pipeline bootcamp. This is my tested bash script to iterate over all of the *.fastq files in the current working directory. This works with the versions that you can install via homebrew. You need all software installed on your local machine, and NC_007898.fasta and NC_007898.gff to be present for this to work as advertised.
#! /bin/bash # call this script with one parameter; reference genome optional # 1 - reference genome basename bowtie2-build $1.fasta $1 for f in *.fastq; do echo "File -> $f" b=`echo $f | sed 's/\.fastq$//'` echo "Working on -> $b" bowtie2 -x $1 -U $b.fastq -S $b.sam samtools view -uS $b.sam | samtools sort - -o $b.bam samtools index $b.bam bcftools mpileup -uf $1.fasta $b.bam | bcftools call -c -v - > $b.vcf done
You could totally convert this into a script that would run on the cluster with a few module loads and an appropriate header.
Examining chloroplast variation
From a biological perspective this data is interesting because it represents organelle genomes from a closely related set of tomato cultivars. While all of these cultivars are the same species, tomato varietals as different as the German Cherry and the Cherokee Purple have obviously undergone some divergence. Here, you are trying to see if that divergence is represented in the chloroplast sequence.
Your goal in this exercise is to find out:
- How does the genomic divergence between these cultivars show up in terms of individual sequence variations in the chloroplast genome?
- Are variations between the genomes primarily SNPs, or small insertions or deletions? We’ll go over the VCF file format in class so you can see how to tell.
- How are the individual SNPs and/or indels dispersed in the chloroplast genome? Do they disproportionately affect particular genes or regions?
- Can you see any evidence of heterogeneity in the chloroplast genome sequence in these samples (i.e. more than one variant at the same site, with respect to the reference genome?)
Use the sequencing data for all 12 samples to answer this question.
In your report
Describe the workflow of tools (most of which you have already learned) that you will need to carry out this project, starting from the beginning (assuming you have raw data and need to start from square one). For these data, it may be worth exploring the results of being more strict about quality score cutoffs than you were when the goal was mapping and creating a consensus. But there will also be a trade off with throwing away too much of a data set that is already rather small.
You may need to look at manual pages for some tools that we have already learned in order to tweak how you use them to get the right answers. Be thorough in your description of methods, so that another person could reproduce your work exactly.
Use genome browser plots, tables, or other figures to summarize your findings as appropriate.
The bcftools call parameter settings are where you should look if you want to change how many alternate alleles you think are possible (which might be helpful in detecting multiple forms of the chloroplast genome). The current version of bcftools has two calling methods — the multiallelic caller and the standard caller (bedtools call -c). We are looking, of course, at data from a haploid genome and comparing it to a close reference. Thus you might expect that only one variant would exist in the sample and you do not need to use the multiallelic caller. There is some evidence that chloroplast sequences from a single plant may show chimerism (i.e. chloroplasts in different cells will have slightly different sequences, just like human cells at different places in the body can show chimerism due to different somatic mutations). If you used the multiallelic caller you might be able to see such phenomena if they occurred. You can set ploidy and choose which caller to use following the options in the current version of the documentation.
HINTS: If you remember from the bedtools lab a couple of weeks ago, a variant call file in VCF format can be overlapped with a GFF feature file for the same reference genome, and you can make the overlaps show which genes have variants.