Microbial community analysis with QIIME2
This tutorial makes use of the data from the NC Urban Microbiome Project, a collaboration seeded by the Department of Bioinformatics and Genomics and involving participants from our department as well as Civil Engineering, Biology, and Geography and Earth Science.
Our goal in lab this week is to analyze 16S ribosomal RNA sequences from mixed microbial samples using QIIME2. The QIIME2 analysis will tell us what identifiable microbes are present in the samples (usually at the genus level rather than specific species or strains) and also allow us to perform some high level comparisons between samples.
The samples that you have will come from four locations throughout the Charlotte-Mecklenburg wastewater treatment system. All samples were collected on the same day.
- RES (residential sewage) 3x
- UPA (stream background upstream of treatment plant) 3x
- ATE (aeration tank effluent — inside the treatment plant) 3x
- DSA (stream, downstream of treatment plant) 3x
Installation and setup
QIIME2 is on the cluster but you can also do this tutorial on a laptop. Or you can run the computation heavy denoising/clustering step on the cluster (takes about 9 hrs) and do the rest of the fast steps locally.
Installing QIIME2 is a little involved, and has many options. There is a prototype GUI (which does not have all features implemented yet); there’s an Artifact API. There’s also a basic nuts and bolts command line interface, which is what we really want. You can install QIIME2 either using a Docker image (if you know how to do that) or natively, through the conda package manager. (What? Another package manager? Yes. Easy to use on Mac.)
Then to get QIIME 2 (first two steps are only if you already had conda/did not have wget):
conda update conda
conda install wget
wget https://data.qiime2.org/distro/core/qiime2-2019.1-py36-osx-conda.yml conda env create -n qiime2-2019.1 --file qiime2-2019.1-py36-osx-conda.yml
To run QIIME2, you will need to activate the environment:
source activate qiime2-2019.1
qiime2 --help
If after the “source” command, you get an error that says your shell has not been configured to do this, you may also need to type:
conda init bash
Pre-processing of reads
Normally, you’d need to trim these reads. We did this for you in advance, using trimmomatic, with the following parameters:
ILLUMINACLIP:$HOME/Trimmomatic-0.36/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:3:20 MINLEN:50
Starting QIIME
Starting QIIME is going to put you into a new command line environment. You will know that you are in the QIIME environment because your command line prompt will change to have “(qiime2-2019.1)” in front of it. Download a zip file containing the data here.
Here’s how I set my workspace up. I have a directory called /Users/cgibas/sandbox/qiime-urbanmicro on my laptop. In that directory, I have a directory called 16s that contains the original fastq files from the zip archive. I am doing everything from the qiime-urbanmicro directory and have set up files and paths as if that is the case. Obviously you need to pay attention to file locations and file paths if this exercise is going to work!!!
Creating a manifest file and a QIIME object
The first thing that you will need to do is to import your files into QIIME. This is done using the import tool, and you will need to prepare a manifest file which tells QIIME where the individual files are and which sample they are part of. Below is what the first few lines of my manifest look like. I named the samples UPA1, UPA2, UPA3 etc so that the sample ID would tell me both which location and which replicate I am looking at. There will be an opportunity to attach better metadata to this later. Note: this file (and all QIIME input) is picky and must be exactly right. So don’t make typos. Ask me how I know this. 🙂 You should be able to see the location code (UPA = upstream, RES = residential sewage, ATE = aeration tank effluent, DSA = downstream) in the middle of the filename, followed by the sample number (64, 65 etc).
# paired-end PHRED 33 fastq manifest file
sample-id,absolute-filepath,direction
UPA1,$PWD/16s/28APR2016Run_Sample_UPA_64_TAGACA-CACGTA_L006_R1_001-paired.fastq.gz,forward
UPA1,$PWD/16s/28APR2016Run_Sample_UPA_64_TAGACA-CACGTA_L006_R2_001-paired.fastq.gz,reverse
UPA2,$PWD/16s/28APR2016Run_Sample_UPA_65_TAGACA-CGTACG_L006_R1_001-paired.fastq.gz,forward
UPA2,$PWD/16s/28APR2016Run_Sample_UPA_65_TAGACA-CGTACG_L006_R2_001-paired.fastq.gz,reverse
Once you have built your manifest file, you should be able to import it with the following command. Note: this is one single command line, but here is a nicer way to write it, so that you can see which options were used on a complex command line. You can write it this way in a PBS script for the cluster, too.
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path UMtutorial-manifest.csv \
--output-path UMtutorial-demux-paired-end.qza \
--input-format PairedEndFastqManifestPhred33
Summarize and denoise your data
The QIIME website has a flowchart visualization of what happens in this group of steps in your QIIME analysis. We started at the third column in the flow chart, reading in paired end sequences with quality. The input sequences we gave you were pre-trimmed outside QIIME using Trimmomatic, so we’re skipping over the 4th column and heading straight for denoising. The variant of the denoising step we need is denoise-paired. We’re going to use DADA2 because that’s the best approach available right now. In real life you don’t really need the trimmomatic trim before the denoising step, at least with DADA2, but we’ll roll with the sequences we’ve got for now.
First, you can make a high-level summary of your sequence quality using the qiime demux summarize command.
qiime demux summarize \
--i-data UMtutorial-demux-paired-end.qza \
--o-visualization UMtutorial-demux-paired-end.qzv \
--p-n 100000
This will produce a QIIME visualization of your data quality and tabular summary of the sequences in your dataset, which you can view by dragging and dropping it at the q2view website: https://view.qiime2.org/ . (Any time you produce a *.qzv file in this tutorial, you can view it in q2view.) Examine the summary and see if you think any of the samples might be problematic, and why. Discuss that in your writeup.
Next, use the DADA2 module in QIIME to denoise your data, as we covered in class on Wednesday. This is just the denoising step at this point. Warning — this step is a long step.
qiime dada2 denoise-paired \
--i-demultiplexed-seqs UMtutorial-demux-paired-end.qza \
--o-table UMtable-dada2O \
--o-representative-sequences UMrep-seqs-dada2O \
--p-trim-left-f 0 \
--p-trim-left-r 0 \
--p-trunc-len-f 118 \
--p-trunc-len-r 118 \
--p-n-threads
--o-denoising-stats UMdenoising-stats.qza
Make a mapping file
Your mapping file will need the unique code for each sample in the first column. The remainder of the metadata is up to you. I would put some information in there that connects the actual original fastq files to the sample. You also need some useful categories for your data. I called my four sample types clean (upstream water), pre (residential influent, PRE treatment), wwtp (the aeration tank location inside the plant) and post (downstream water). You can have as many of these kinds of labels as the data requires, and it will help you to visualize your data in different ways later.
Sequence clustering
The very simplest way to deal with your denoised data is to use a dereplicating step and produce 100% OTUs (i.e. ASVs, with a sequence for every variant in the dataset that survived denoising and error correction). Because the DADA2 denoising includes its own chimera removal step and dereplication, you can proceed with the feature table and representative sequences directly out of DADA2 to downstream analysis. You can read more about this in the QIIME2 pipeline overview document here.
Summarizing your clusters
The next steps are to summarize your feature table and tabulate sequences. The feature table is the table that contains abundances for each feature (OTU or in the 100% case, ASV) and is a central artifact in QIIME, and the sequences are the actual DNA sequences that define each feature.
This is the first time you’ll actually use your mapping file, so it’s best to validate it using the Keemei plugin for Google Sheets. The one I had in the drive from before was a little broken. Here’s what a successful Keemei report looks like:
Fix the file so that QIIME will accept it. Note: it must be tab-separated, not comma separated, even though both will look alike to Keemei when imported into Google Sheets. Then run the following summary steps:
qiime feature-table summarize \
--i-table UMtable-dada2O.qza \
--o-visualization UMtable-dada2O.qzv \
--m-sample-metadata-file UMtutorial-mapping.tsv
qiime feature-table tabulate-seqs \
--i-data UMrep-seqs-dada2O.qza \
--o-visualization UMrep-seqsO.qzv
Generating a phylo tree for diversity analysis
You can now use a different QIIME module to build a phylogenetic tree that the program will refer to when calculating distances for diversity, if you are using a distance metric that takes phylogeny into account.
qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences UMrep-seqs.qza \
--o-alignment UMaligned-rep-seqs.qza \
--o-masked-alignment UMmasked-aligned-rep-seqs.qza \
--o-tree UMunrooted-tree.qza \
--o-rooted-tree UMrooted-tree.qza
Calculating alpha diversity
QIIME2 has a variety of alpha and beta diversity metrics and other analyses that you can produce, and it’s set up with some “core” calculations that it will do relatively automatically. The core-metrics-phylogenetic module will analyze Shannon diversity and evenness as well as some qualitative metrics of alpha diversity, as well as Bray-Curtis and weighted (phylogenetic) and unweighted UniFrac analyses. It will automagically produce a bunch of PCoA plots and other stuff you might want. There’s a list in the QIIME2 Moving Pictures tutorial that shows you everything that will be produced and even links examples of what it should look like. You need to pick your sampling depth based on the feature table summary you generated above — open it in q2view, click the interactive sample detail tab, and move the slider until you can pick a value that maxes out the rarefaction depth while not throwing away too many samples. Per the QIIME2 manual: “If you provide --p-sampling-depth 500
, this step will subsample the counts in each sample without replacement so that each sample in the resulting table has a total count of 500. If the total count for any sample(s) are smaller than this value, those samples will be dropped from the diversity analysis.” You decide!
qiime diversity core-metrics-phylogenetic \
 --i-phylogeny UMrooted-tree.qza \
 --i-table UMtable-dada20.qza \
 --p-sampling-depth 500 \
 --m-metadata-file UMtutorial-mapping.tsv \
 --output-dir core-metrics-results
You can produce alpha diversity boxplots — a way of testing the association between alpha diversity and different metadata categories — by using a command like the following, you could do this based on the Shannon diversity, but you could also use the Faith or evenness metrics following a similar format. The metadata column where you should expect to see an obvious effect is ‘location’, the one that grouped the replicate groups into four locations.
qiime diversity alpha-group-significance \
--i-alpha-diversity core-metrics-results/shannon_vector.qza \
--m-metadata-file UMtutorial-mapping.tsv \
--o-visualization core-metrics-results/shannon-group-significance.qzv
Beta-diversity
We can use the diversity module to analyze beta-diversity as well. PCOA plots based on the four different beta-diversity metrics are contained in your core metrics results directory, and will be *emperor.qzv files. When loaded into q2view, they will show the PCOA plot and you can use the right side pulldown menus to color by location. It may be interesting to look at the impact of the different distance metrics on separation of results in the PCOA. You can also test beta-diversity between pairs of sites using a command like this one.
qiime diversity beta-group-significance \
--i-distance-matrix core-metrics-results/unweighted_unifrac_distance_matrix.qza \
--m-metadata-file UMtutorial-mapping.tsv \
--m-metadata-column location \
--o-visualization core-metrics-results/unweighted-unifrac-location-significance.qzv \
--p-pairwise
You could also base this on weighted UniFrac, which is phylogenetically aware, but it may take longer, or the Jaccard distance which is qualitative and based on presence/absence. The above version will use a PERMANOVA test to perform pairwise comparison between all possible pairs of locations in the data set. There are four locations, so this may take some time. The most interesting pair to look at will be upstream (UPA) vs. downstream (DSA). The output is viewable in Q2view where you can look at the various barplots side by side and decide where you think the groups have significant differences.
Once you have completed these analyses, you’ll have a high level view of how the sites differ from each other. Look at the examples in the Moving Pictures tutorial and see if you can figure out how to do the following three things on your own:
Analyze taxonomy using the naive Bayesian classifier and produce a taxonomy bar plot. Look at a high level (e.g. lv. 2 which is phylum) and see if you can identify some groups that differ significantly inside (ATE) and outside (UPA, DSA) the water treatment plant, or between the treated (ATE) and untreated (RES) waste.
Use ANCOM to analyze differential abundance between sample groups. Look at a few taxa from each comparison and see if you can identify any interesting changes!