BINF 6203: 16S rRNA classification with QIIME
This tutorial makes use of the data from the NC Urban Microbiome Project, a collaboration seeded by the Department of Bioinformatics and Genomics and involving participants from our department as well as Civil Engineering, Biology, and Geography and Earth Science. Many thanks to Kevin Lambirth, who created the original version of this tutorial for our lab interns.
This tutorial uses QIIME 1.9 — there is a newer QIIME and we are all in the process of learning it, but it works enough differently that the commands here may not work right out of the box. Stick with QIIME 1.9 for now, and watch this space during summer for a QIIME 2.0 tutorial that you can try out if you want to learn the updates.
Our goal in lab this week is to analyze 16S ribosomal RNA sequences from mixed microbial samples using QIIME. The QIIME analysis will tell us what identifiable microbes are present in the samples (usually at the genus level rather than specific species or strains) and also allow us to perform some high level comparisons between samples.
The samples that you have will come from four locations throughout the Charlotte-Mecklenburg wastewater treatment system. All samples were collected on the same day.
- RES (residential sewage) 3x
- UPA (stream background upstream of treatment plant) 3x
- ATE (aeration tank effluent — inside the treatment plant) 3x
- DSA (stream, downstream of treatment plant) 3x
Installation and setup
QIIME is on the cluster but you can also do this tutorial on a laptop — the data set is small enough that it’s no trouble to run. Obviously if you’re doing this on the cluster you can skip the install.
Installing QIIME is quite involved. Make sure your OS is updated (to Sierra if you’re on OSX) and then “brew install” the following packages if you don’t have them already:
Then download MacQIIME:
tar xvf MacQIIME_1.9.1-20150604_OS10.7.tgz cd MacQIIME_1.9.1-20150604_OS10.7 ./install.s
Download the RDP Gold database:
curl -O http://drive5.com/uchime/rdp_gold.fa
And add it to your working directory. You’ll also need to install a bunch of R packages.
With the install.packages(“”) command, install:
- devtools (if not already installed)
Then, source(“http://bioconductor.org/biocLite.R”) and install with biocLite:
Finally, devtools install the current dev version of the biom package using the following command:
Pre-processing of reads
Normally, you’d need to trim these reads. We did this for you in advance, using trimmomatic, with the following parameters:
ILLUMINACLIP:$HOME/Trimmomatic-0.36/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:3:20 MINLEN:50
QIIME also needs paired reads to be merged together into a single file. There are a number of software packages that do this and ensure that reads that belong together end up interleaved correctly. QIIME has an internal script for this, but it is known to have some problems, so we recommend merging reads using the program PEAR. Other packages such as seqtk and khmer are also capable of merging paired reads cleanly.
Files also need to be converted from FASTQ to FASTA — this can be done using FASTX-Toolkit or a number of other tools.
We have done these steps for you for the purposes of the tutorial.
Starting macqiime (just type “macqiime” at the command line) is going to put you into a new command line environment. You will know that you are in the QIIME environment because your command line prompt will change to have “MacQIIME” in front of it.
Note: sometimes cutting and pasting, rather than typing, commands at this command line will cause errors if there are hidden spaces or characters, so type ’em in. You can see all the options for any of the QIIME scripts by typing scriptname.py — help. This will help you figure out what went wrong if you happen to get an error.
Once you’re in the QIIME environment, type:
to be sure the base install has everything it needs. If it says “OK”, you can proceed.
Here’s how I set my workspace up. I have a directory called /Users/cgibas/Documents/DATA on my laptop. In that directory, I unzipped the tutorial.zip file and renamed the directory UEGP_16S_Tutorial_Files. Inside that directory, I moved the *.fasta files into a directory simply called fastas. I also saved a copy of the mapping file template as UEGP_Tutorial_Map.txt. Then I edited that file in Excel.
Creating a Sample Mapping File
QIIME needs to know some information about your samples, including their barcodes and primers (if data has not been demultiplexed) and any metadata and description that you want to associate with each sample. There is an example mapping file in your MacQIIME directory that you can use as a template, which can be found in the tutorial subdirectory.
If you view the file in TextWrangler or another text viewer, you’ll see that it has two descriptive commented rows starting with #, followed by 9 rows of delimited data. To edit the file, it’s easiest to just import it into Excel as a space-delimited file.
The first three columns and the last column have specific purposes. In between them can go any number of columns of metadata, that is, information about the samples that you have collected.
Your mapping file will need some kind of code for each sample in the first column. The second and third column must be present, but in our case, they will be empty — our data is already demultiplexed and you will not need to detect barcodes or primers. Your metadata should include the information I gave you about where the sample came from. I made optional columns for location and replicate, and a column for a unique identifier. We have found that it works well to use the unique filename for each sample as the unique identifier, even though the filenames are long. Your final column must be a “Description” column, but can be empty. So I made a file that looks like this (example shows header and first three lines only):
The next step is to validate your mapping file. Still in the QIIME environment, type:
validate_mapping_file.py -m UEGP_Tutorial_Map_v2.txt -o check_id_output_v2/ -p -b -j InputFileName
This tells QIIME to check the mapping file assuming that it does not need to find primer (-p) or barcode (-b) information, and tells it to look for a unique identifier in the InputFileName column. My file shown above validated without errors or warnings.
Merging files into a combined FASTA
QIIME was written at a time when it was standard to receive mixed multiplexed data with barcodes, rather than fully demultiplexed data with barcodes trimmed off. So we have to actually put our nicely cleaned up data back into a combined file. QIIME has a script for that. The options you need are -i followed by the directory where your fasta files are, -m followed by the name of the mapping file, -c followed by the column with the input file names, -n with a number at which you want to start enumerating sequence labels, and -o for the filename of your combined file.
UEGP_16S_tutorial_files $ add_qiime_labels.py --fasta_dir=fastas --mapping_fp=UEGP_Tutorial_Map_v2.txt --filename_column=InputFileName --count_start=1 --output_dir=combined_fasta
This takes about 10 minutes on my laptop but YMMV.
An “operational taxonomic unit” is not a formal taxonomic grouping, but rather an operationally defined cluster of sequences that seem to belong together. OTUs can be identified with some certainty by matching to known patterns.
QIIME offers three main pipelines for classifying OTU’s: Closed, open, and de novo. Closed reference OTU picking will use the greengenes 16s sequence database to align your sequences with 97% identity. Any sequences that fail to match the database are discarded. This requires the least amount of time and memory, and if you are uninterested in identifying possible novel taxa, this is a good choice.
De novo OTU picking does not utilize a reference database of any sort. Rather, it uses clustering to assign taxonomy with various algorithms that can be chosen by the user (QIIME uses uclust by default). This can take a long time and use a lot of memory, depending on how many samples are processed together.
Open OTU picking is a combination of closed and de novo in that it will align the provided sequences to the greengenes database, and any that fail to hit will be clustered de novo. This can also take a long time and use a lot of memory, but may be parallelized to assist in processing time.
The best method for your data will depend on your experimental design and hypothesis. Here, we will use open reference OTU picking. The input file will be the labeled and combined fasta you created in the previous step. This is the command I used to run this step on my laptop. It took about 1.5 hours to run on my laptop with this data set. For larger data sets you should run this step on the cluster.
pick_open_reference_otus.py -i ./combined_fasta/combined_seqs.fna -o open_reference_otus -aO 2
Chimera removal with vsearch
With HiSeq amplification steps in library preparation, sequence artifacts may be introduced that are not unique, individual sequences. These need to be removed in de novo and open reference OTU picking steps (closed reference OTU picking does not require this step, as chimeras are removed as part of the alignment to the greengenes database). The most popular algorithm for performing this is ‘usearch’; however it is not open source and has licensing issues for anything but the 32-bit version of the software. Therefore, here we will use the faster open source 64-bit alternative ‘vsearch’ to detect chimeric sequences. Note: This step may be conducted on the entire combined sequence.fna file prior to OTU picking; however, this greatly increases the time this step will take and is unnecessary as QIIME has already picked the subsampled sequences for downstream analysis in the file ‘rep_set.fna’. We will use the ‘rep_set.fna’ sequence file generated in the OTU picking step as input.
vsearch --uchime_ref open_reference_otus/rep_set.fna -db RDP_gold.fa -chimeras chimeras.fasta -nonchimeras nonchimeras.fasta
This generates a fasta file containing chimeric sequences, and one with all chimeric sequences removed. Now we must remove the chimeras from our OTU table, and remove any OTU’s that only appear once (singletons). Note: If you are examining 16s sequences from a cultured and controlled experiment rather than environmental samples, you may want to retain the singletons. In such cases, simply remove the ‘-n 2’ from the end of the command.
filter_otus_from_otu_table.py -i open_reference_otus/otu_table_mc2.biom -o otu_table_mc2_nochimeras.biom -e chimeras.fasta -n 2
This will regenerate a new OTU table with all chimeric sequences and single OTU appearances removed. Because we’re doing this with our representative sequences instead of the entire combined fasta file, this works much faster!
Checking for problematic samples
Now that chimeras and singletons are removed, we need to look at a summary of our biom table to check for any samples that may exhibit problems that will confound our downstream results. Mainly, we may want to remove any samples that have drastically lower OTU counts than the others in the group. We can use the command below to generate a summary text file of our OTU table:
biom summarize-table -i open_reference_otus/otu_table_mc2_nochimera.biom -o otu_table_summary.txt
The summary for your data set is shown below. While there is a significant range of counts per sample, all of the samples have a relatively large number of counts and you most likely do not need to remove any samples from this group to maintain the quality of your analysis. If you did have to remove samples you would run a couple of extra scripts at this point.
Rebuilding the phylogenetic tree
You can now rebuild your phylogenetic tree file with chimeras removed.
filter_tree.py -i open_reference_otus/rep_set.tre -t chimeras.fasta -o open_reference_otus/rep_set_nochimeras.tre
Calculating alpha diversity
Now that we have our filtered biom table, we’re ready to look at some species richness. Alpha diversity is a measure of the species richness and abundance in each sample in the OTU table and mapping file. This must be done on raw, non-normalized OTU tables. The table that you produced above with nochimeras can be used here. The alpha diversity calculation also requires a tree file, which was produced above as rep_set_nochimeras.tre.
There are many alpha diversity metrics you can choose to use in QIIME (see them all by typing alpha_diversity.py –help), but two of the most common measures are Shannon entropy and Chao1 richness. Here, we calculate Shannon, Chao1, phylogenetic distance (PD_whole_tree), and the number of unique observed OTUs per sample.
alpha_diversity.py -i open_reference_otus/otu_table_mc2_nochimera.biom -m PD_whole_tree,shannon,chao1,observed_species -o alpha_diversity.txt -t open_reference_otus/rep_set_nochimeras.tre
For each alpha diversity method you specified, the program will produce columns with three values, one with raw abundance values, one with normalized values, and one with the subsampled binning categories used to normalize the raw values with. You can extract these columns from the alpha diversity file (alpha_diversity.txt) and add them to your original mapping file to continue with the subsequent analysis.
add_alpha_to_mapping_file.py -i alpha_diversity.txt -m UEGP_Tutorial_Map_v2.txt -o UEGP_Map_With_Alpha.txt
Our first statistical comparison is direct and simple: which OTUs are significantly different between specified categories? The category options available to you will vary depending on what metadata columns you added to your mapping file. This step uses raw, non-normalized matrix counts, much like alpha diversity. Here, we test which OTUs are differentially abundant between our two experimental locations: upstream A (our clean pre-wastewater stream) and untreated residential wastewater. We’re actually using parts of the DESeq2 algorithm to calculate the differentials.
differential_abundance.py -i open_reference_otus/otu_table_mc2_nochimera.biom -m UEGP_Map_With_Alpha.txt -o diff_otus.txt -a DESeq2_nbinom -d -c Location -x Upstream -y Residential
The output will be two files. One is a text file containing a list of differentially abundant OTU’s between the two categories you specified along with their statistical values, logfold change, taxonomy. The other is a PDF file showing a volcano plot and dispersion curve. You can repeat this for other categories you wish to measure OTU differences between.
There are a lot of significant differences between these two locations, as we might expect. OTUs in the table are labeled by number, which is not necessarily very helpful with interpretation, but we can do some further work to try to figure out which species are present in the residential wastewater that aren’t in the clean stream.
Normalizing the OTU table
Now that we’ve calculated alpha diversity and differential abundance, we should normalize the table for use with other analyses that do not require raw data, such as beta diversity. We will use CSS normalization instead of DESeq, because DESeq can introduce negative values as a result of its log transformation, and this can present problems for downstream diversity analyses.
normalize_table.py -i open_reference_otus/otu_table_mc2_nochimera.biom -a CSS -o normalized_otu_table.biom
No one ever said bioinformatics software was perfect and here’s an example. When we normalize QIIME data using R packages, they can strip or corrupt the taxonomy formatting of the OTU tables. To avoid errors downstream, we must manually add that information back in with a biom command, using our original uclust assigned taxonomy text file:
biom add_metadata -i normalized_otu_table.biom -o normalized_otu_table_taxa.biom --observation-metadata-fp open_reference_otus/uclust_assigned_taxonomy/rep_set_tax_assignments.txt --sc-separated taxonomy --observation-header OTUID,taxonomy
If at any point after this you receive the error: “TypeError: iteration over a 0-d array”, your biom table taxonomic data is corrupted and the above step will need to be repeated.
Now that we have a normalized table, we can accurately see what taxa are present in each sample, and relative amounts of each. The script below will generate summaries of taxa from an input biom file at the phylum (level 2), class (level 3), order (level 4), family (level 5), genus (level 6), and if available, species level (level 7). If you only want one level to be calculated (E.g.: phylum level only), you may pass the -L parameter and level number to the script.
The output below is the L6 table for your data. You can see at the beginning of each row a description of the taxonomic assignment and then the quantities at which it is present in each sample.
summarize_taxa.py -i normalized_otu_table_taxa.biom -o ./taxa_summary
You can also tell QIIME to add this classification information to your mapping file with the ‘-m’ parameter. QIIME will create a new mapping file in the taxa directory with the added information for each taxonomic level specified (E.g.: mapping_L2.txt, mapping_L3.txt, etc). This can be useful later if creating visualizations within QIIME or doing statistical tests on taxa from the mapping file, but is entirely optional. To do this add -m UEGP_Map_With_Alpha.txt to the command line above.
You can make some basic figures in QIIME, like bar plots, to display abundances, with a similar command:
summarize_taxa_through_plots.py -i normalized_otu_table_taxa.biom -o taxa_plots -m UEGP_Map_With_Alpha.txt
This will produce the commonly used bar plots that are typical in microbiome publications, along with HTML legends to help you interpret them. You can access all the results for your dataset by using a browser to open the bar_charts.html or area_charts.html files in your output directory.
Compute the core microbiome
Different locations and/or treatments may contain unique OTU assignments that are not present in other groups. This next command will allow you to obtain a tsv list of OTU’s unique to a specified field, and will subsample the shared OTU’s in the biom table at 5% intervals, starting at 50% (E.g.: at least 50% of the samples must contain the OTU to be included in the list). Below we are computing the core microbiome for all stream samples. The output will include biom files and text files for each shared level.
To do this I had to change a column in my UEGP_map_with_alpha file. I substituted the replicate information, which really wasn’t useful to me in the end, with a column classifying the samples into stream, plant, and influent locations.
compute_core_microbiome.py -i normalized_otu_table_taxa.biom -o core_microbiome_stream --mapping_fp UEGP_Map_With_Alpha.txt --valid_states "Type:stream"
This analysis produces tables of core OTUs present across samples at different thresholds. Here is what we find common in our stream samples — there is an immense amount of data available here and making sense of it all is more work than we can do in one lab, but you can get an idea, for example, of which OTUs are present in 100% of your stream data, and contrast with other subsets of the dataset.
Beta diversity analysis
Previously we calculated alpha diversity, or the within sample richness. Now we need to calculate beta diversity, or the dissimilarity distance between samples within an experimental group using multiple pairwise comparisons. QIIME offers quantitative metrics (weighted Unifrac) that include phylogenetic information in distance calculations (using our previously calculated .tre file), qualitative metrics without phylogenetic factors (unweighted Unifrac), and dissimilarity based on abundance/counts per sample (Bray-Curtis). All three may be calculated at the same time, and depending on your dataset, it may vary which one you choose to use. This command generates a text file with diversity metrics for each method:
beta_diversity.py -i normalized_otu_table_taxa.biom -m bray_curtis,unweighted_unifrac,weighted_unifrac -t open_reference_otus/rep_set_nochimeras.tre -o beta_diversity
Making PCoA plots
Once you have beta diversity distance matrices, you can generate x and y coordinates from the matrix to plot in QIIME. The command below will take the weighted Unifrac distance matrix we just created and calculate the coordinates for the plots:
principal_coordinates.py -i beta_diversity/weighted_unifrac_normalized_otu_table_taxa.txt -o beta_diversity/weighted_unifrac_coords.txt
To plot unweighted unifrac or bray-curtis coordinates, repeat the above script using the corresponding distance matrices.
To create PCoA plots from the coordinates you just calculated, you can use the command:
make_2d_plots.py -i beta_diversity/weighted_unifrac_coords.txt -m UEGP_Map_With_Alpha.txt
You will produce one html file that you can open in a browser to look through all of the various plots that are produced. Not all of these will be meaningful. There will be numerous plots, labeled by column in the mapping file. One potentially meaningful labeling of principal components is based on the Type column we created above. You can see in this plot that the group of “stream” samples are quite different from the others, as we’d expect. If you added summarized taxa to your mapping file you will also have plots that label by particular taxa at a selected taxonomic level. You could go back and add that information if you did not previously.
Finding significant differences between groups
One final thing we can look at with QIIME scripts is the significance of differences between groups of samples. Many statistical tests are available to do this in QIIME.
Adonis, anosim, bio-env, Moran’s I, MRPP, permanova, permdisp, and db-RDA are all available. Adonis is a more robust version of a permanova which eliminates the need to run permdisp in conjunction with a permanova. Db-RDA is also very similar to Adonis, and Adonis can handle numerical and categorical data. Anosim is also very similar to MRPP, which will determine whether groups are significantly different based on categorical variables. So here, we will conduct an Adonis test on our distance matrices generated from the beta diversity step above (not the coordinates file) comparing the “Type” group factor. In the small subset of the UEGP data you have here, the only other possible split is on the Location group factor so you could also try that.
compare_categories.py -i beta_diversity/weighted_unifrac_normalized_otu_table_taxa.txt --method adonis -m UEGP_Map_With_Alpha.txt -c Type -o weighted_adonis_timepoint
Our results from this analysis show that (no surprise really) different types of sample (stream vs. influent vs. inside the wastewater treatment plant at the aeration stage) have significant differences with an R2 of 0.86 meaning they explain 86% of the variation in grouping. Other factors that we looked at in the main study, including collection timepoints, had very different results.