Calling variants from mixed viral samples

Calling variants from mixed viral samples

This is just an example of some bioinformatics work I’m currently doing, not a homework assignment.

I learned to do something new yesterday and so I thought I’d put up a tutorial about it since this illustrates a few software installation, simple pipelining and scripting tasks that we commonly do in the class.

Where’s the data from? We collect samples of wastewater and sequence the viral fragments out of them. Often the viral genome sequence we get is not complete, but it still can contain useful information about which SARS-CoV-2 variants are present. I have sequences from 46 wastewater samples that were collected right before the time that we found the first omicron variant sequence in North Carolina right here on the UNC Charlotte campus. I wanted to see if we could detect the omicron carrier’s presence in building wastewater, and so we sequenced these wastewater samples using the Oxford Nanopore platform.

The program I want to use to analyze the mixed viral sequence samples is called Freyja. It works on a pretty simple principle, by matching the variants detected in your sample to a collection of patterns that represent the changes to the viral genome that are characterized by the different variants of concern, and “demixing” them from each other to produce strain abundances, using an analysis that also takes into account the depth of coverage on each region of the genome. It sounds like a reasonable approach and it’s from a research group that knows how to make good software, so we’re going to trust it to give us a decent first look at the data.

We’re in conda-first mode, so I tried the conda install first, and it was straightforward and worked exactly as advertised on the github page.

conda create -n freyja-env
conda activate freyja-env
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install freyja

On the computer I was installing it on, my conda was not up to date, and I had somehow gotten my environment in a state where I was not the owner of my conda environments.txt file, so before all that would work, I also had to do this. You would have to do this if you got the error that you did not have permission to write to the environments.txt file. This is a fairly common conda weirdness on Macs that i have to fix from time to time:

sudo conda update -n base -c defaults conda
sudo chown cgibas /Users/cgibas/.conda/environments.txt

This was a case where I did not just install the new software in my base conda environment. I created a special environment for Freyja so that it can have what it needs installed and configured in an internally consistent way, and I installed it inside that environment. If we’re used to moving from our base shell environment on the local computer to a shell on an external computer, switching to a different conda environment is almost like entering a new shell with specific parameters that are different from your base environment while you’re still on the same computer. You’re somewhere else now. You can’t run Freyja from your base environment, you can only run it when you’ve activated your freyja-env environment.

Once I got this set up, I could just run Freyja’s two key commands on a .bam file that I already had (how we get that .bam file is a story for another day, but it’s a midpoint output in the ARTIC bioinformatics pipeline). A .bam file is just a file of next gen sequencing reads that have been positioned relative to a reference genome using a sequence aligner. Freyja’s documentation wants you to run different alignment and primer trimming steps than the ARTIC pipeline uses, but what’s important for now is that we feed in a .bam file that has had primer trimming done on it, not exactly how it was done. I wanted to see if I could just take the pipeline that was optimized for Oxford Nanopore data and use part of it to feed into Freyja rather than reinventing the whole thing.

So as an example, I have a .bam file named 120121-60.primertrimmed.rg.sorted.bam. I run these two steps to call the variants and depths in Frejya and perform the demixing step.

freyja variants 120121-60.primertrimmed.rg.sorted.bam --variants 120121-60.variants --depths 120121-60.depths.out --ref MN908947.3.fa
freyja demix 120121-60.variants.tsv 120121-60.depths.out --output 120121-60.demix.out

I got the following demixed output:

summarized	[('Omicron', 0.9723250000049135), ('Other', 0.0062643999846659386)]
lineages	['B.1.1.529' 'B.1.1.239' 'B.1.250' 'B.1.627' 'B.1.1.182']
abundances	[0.972325   0.00311042 0.00108705 0.00106476 0.00100217]
resid	13.429032467235544

There’s pretty definitely omicron in this sample, in fact it is the only variant of the virus that’s present in any meaningful amount.

Note: the one trick that I had to do to convince the program to run was to edit the first line (the fasta header line) of the reference file, MN908947.3.fa, so that it contained only the following. It did not work in the form downloaded directly from GenBank where that line contained more database cross-references and other text. Before I did this it gave me a whole lot of “sequence not found” errors.


So now I could calculate this for any number of bam files that I like. The problem is that I have 46 files I want to do this to and I don’t really want to type in those command lines over and over and process the files one by one. So I need to write a shell script. I went over how to write a simple iterative loop to run the same commands on a directory full of files here, so we’re just going to follow that example without a whole lot of explanation.

#!/usr/bin/env bash

mkdir demixed

for f in *.bam; do
	b=`echo $f | sed 's/\.primertrimmed\.rg\.sorted\.bam//'`
	freyja variants $b.primertrimmed.rg.sorted.bam --variants $b.variants --depths $b.depths.out --ref MN908947.3.fa
	freyja demix $b.variants.tsv $b.depths.out --output demixed/$b.demix.out

In brief:

Line 1 sets the shell environment

Line 2 makes a directory for the demixed output to go to, because Freyja has some subsequent steps that operate on whole directories

Line 3 sets up the do loop over all your files

Line 4 takes the identifying part of the filename to use as basename for output files

Line 5 runs the freyja variants command on the current file, and names corresponding output

Line 6 runs the freyja demix command on the current file, and puts corresponding named output in the demixed directory

Line 7 ends the do loop

The way I’ve set things up in this hastily constructed little script, all my demixed outputs end up in the demixed directory in my current working directory. Which is good, because we want Freyja to plot them, next.

Here we see the basic commands to aggregate the contents of the demixed directory and plot the demixed outputs created in the previous step:

freyja aggregate demixed/ --output omicron-ww-demixed.tsv
freyja plot omicron-ww-demixed.tsv --output omicron-ww-demixed.png

On the first line, it turned out to be critical to have the ‘/’ following the directory name (demixed/ instead of just demixed). I actually had to open an issue on the Freyja GitHub because the ‘/’ wasn’t shown as part of the command in the examples, but not having it caused some strange behavior when the program was run.

The plot produced at this step looks like this:

Each sample is shown split by which variants it contains. In these samples taken as students returned to campus from Thanksgiving, it appears that omicron has already shown up in the wastewater, even though it did not really start to show up in many positive tests on campus until a week later. This would be consistent with observations in other cities, and it could be possible that we had omicron circulating for a short time without detection at the testing center. We no longer require COVID tests from vaccinated people with no symptoms, even when their residence is positive for COVID-19.

Comments are closed.