Troubleshooting the read mapping lab

Troubleshooting the read mapping lab

Hey everyone! A few people are reporting some trouble in making this lab work so I updated all my software and went in. I am just going to walk through what worked with one sample.

I started with the BC26 chloroplast file from SRA. This is fine:

fastq-dump SRR1763773

Moved it to a shorter filename:

mv SRR1763773.fastq BC26.fastq

Built the bowtie2 index from the reference genome file. This is fine:

bowtie2-build NC_007898.fasta NC_007898

This command should produce this output, including file sizes:

I did a trial run and got only 30% of reads mapping. We’ll fix that later. But the mapping and variant commands themselves just needed a couple of minor tweaks.

First I ran bowtie2, no problems there:

bowtie2 -x NC_007898 -U BC26.trim.fastq -S BC26.sam

Then I tried samtools view. Had to make a minor adjustment to remove a deprecated parameter here. The new version is supposed to ignore -S but it threw an error so this is what I did instead:

samtools view -b BC26.sam > BC26.bam

samtools sort and samtools indexed worked as advertised:

samtools sort BC26.bam > BC26.sort.bam
samtools index BC26.sort.bam

At the end of this sequence of commands I had the following files for BC26, including intermediate results and the index. The index should be small. It’s cool:

I was able to load this all into IGV with no difficulties. You “Load Genome From File” (in the Genomes menu) NC_007898.fasta first, then “Load From File” (in the File menu) NC_007898.gff (if you want to see the genes) and BC26.sort.bam. This is what it will look like when you zoom in sufficiently.

Now to get the pileup and vcf, these steps worked with the sorted bam. There are currently 2 versions of mpileup, it is in both bcftools and samtools. They are supposed to be backwards compatible but I switched to using the newer bcftools version just in case. So I could get a look at what was being produced, I added a -v in there to produce a VCF rather than a BCF as intermediate output, and the VCF was 112 lines, which seems reasonable based on prior analysis of the data. There’s not a huge # of variants in the chloroplast:

bcftools mpileup -f NC_007898.fasta BC26.sort.bam | bcftools call -c -v > BC26.sort.vcf

You’ll be able to load this VCF file along with your other files into IGV in a 2-step process. First, go to the Tools menu and select igvtools and then Index. Your input file is your VCF. Once indexed, the VCF will display in the browser view. You can index the 12 different VCF files from your different samples to the same reference and display them together.

The consensus step as written in the lab was giving people trouble. My best diagnosis at this point is that the bcftools consensus caller has high standards for quality and so none of the called SNPs in our messy old data are ending up high enough quality, and the call file is empty. I’m still messing with it and I’ll let you know if there’s a solution. Check back for updates!

Comments are closed.