Whole genome shotgun metagenomics with MetaPhLan

Whole genome shotgun metagenomics with MetaPhLan

Like last week’s tutorial, this tutorial uses Urban Environmental Genomics Project data. The original version of the tutorial was developed by Anju Lulla for our student interns.

Preparation and software installation

You can use metaphlan on the cluster and that’s probably the best idea. If you have a reasonably powerful laptop you can run it there on these small data sets.

You can download Metaphlan by cloning it (hg/Mercurial): hg clone https://bitbucket.org/biobakery/metaphlan2. Note: this will put metaphlan2’s directory as a subdirectory of your current working directory. You will need to chmod a+x metaphlan2.py. You will need to put that directory in your bash path unless you want to type the full path to metaphlan2 every time.

Metaphlan2 has some dependencies that you will need to install on your own, if you choose to clone it.  Use brew install to install (or upgrade) bowtie2. Use pip install to install or upgrade matplotlib and scipy.

After Metaphlan2 is installed, you can be sure it will run by calling:

metaphlan2.py -–h 

Sample Files

We’ve run QC steps for you and pre-sampled the FASTQ files down to 20% of their original size using seqtk. The subsampled files can be found in the google drive sampled_tp1_fastq.  Just like last time, there are 12 samples represented in the files for this exercise. Each sample is represented by a pair of FASTQ files, one for the forward and one for the reverse reads. Your samples are from:

  • UPA (3x)
  • RES (3x)
  • ATE (3x)
  • DSA (3x)

Running Metaphlan2 one file at a time:

The command for running metaphlan2 one file at a time is shown below:

~/path/to/metaphlan2.py --input_type fastq 19APR2016HiSeq_Run_Sample_UPA_66D2_UNCC_Gibas_GCCAATGT_L007_R1_001-sampled.fastq,19APR2016HiSeq_Run_Sample_UPA_66D2_UNCC_Gibas_GCCAATGT_L007_R2_001-sampled.fastq --bowtie2out UPA_66D2.bz2 > profiled_UPA_66D2.txt

The processing time for this step was about 15 minutes on my laptop. If you have a multi-core machine you can reduce this time by adding the -–nproc flag followed by the number of processors you want metaphlan2 to use.

This will produce two output files one with extension .bz2 containing the mapping information and a .txt file that contains the relative abundance values at each taxonomic level.

Repeat this step for all twelve samples. (Of course we can do this with a script, and metaphlan2 is also available on the clusters.)

Once you have all twelve .txt files, merge the twelve files using the following command. The merge utility program is found in the utils directory under your metaphlan directory.:

~/path/to/metaphlan2/utils/merge_metaphlan_tables.py *.txt > merged_abundance_table.txt

This table contains information of all taxonomic levels, we are only interested in species level identification, so we can use the UNIX command below to filter out just what we want:

grep -E “(s__)|(^ID)” merged_abundance_table.txt | grep -v “t__” | sed ‘s/^.*s__//g’ > merged_abundance_table_species.txt

What does this command mean? If you open your original abundance table you will see that it looks something like this:

Each line contains the classification of a cluster of reads, at the kingdom, phylum, class, order, family, genus, species level. To get only the species level classifications, you can use grep to find lines containing the “s__” pattern. That’s the first part of the command before the first pipe “|” character.

What you’ll find is that there are a few lines that also contain a t__ classification (strain level) and you’ll need to filter those out. The grep -v portion of the command (between the first and second pipes) will do this. Finally, the portion of the command after the last pipe (sed ‘s/^.*s__//g’) uses sed, the stream editor, to remove everything in the description line other than the actual name of the bacterial strain represented.

Making a Heatmap:

Heatmaps are a typical way of visualizing taxonomic clusters to get an idea about presence and absence of species across multiple samples. To make the heatmap we can use the hclust script/package, developed by the Nicola Segata lab. Download the script using the following command:

hg clone https://bitbucket.org/nsegata/hclust

We can use this species level abundance table to generate a heatmap similar to the one shown below, with the following command:

~/Applications/metaphlan2/hclust/hclust.py --in merged_UEGP_abundance_species.txt --out merged_UEGP_abundance_species_heatmap.png -c bbcry --top 25 --minv 0.1 -s log --flabel 1 --slabel 1

Note: something’s off with a few of my samples, but this should give you the idea. The heatmap gives you an idea of which species are abundant in each location.

Running Metaphlan2 on all files at the same time:

The commands shown below can be incorporated into a script and this will enable you to run the job on the cluster by sampling submitting the script using qsub.

The strategy for doing this is a bit different. I concatenated the paired files first and then used the merged.fastq files as an input for metaphlan2

cd sampled_tp1_fastq

The command to merge the R1 and R2 files:

for file1 in ./*R1*; do file2=${file1/R1/R2}; zcat ${file1} ${file2} > ${file1%R1*}merged.fastq; done

This command will produce a *merged.fastq file which will be used as an input for metaphlan2 in the command shown below:

for f in *merged.fastq; do metaphlan2.py $f --input_type multifastq --nproc 12 > {f%merged.fastq}_profile.txt; done

Here, I am running this on the cluster so I have used an nproc value of 12 to speed up the process.

The next two commands are the same as before:

merge_metaphlan_tables.py *.txt > merged_abundance_table.txt

grep -E "(s__)|(^ID)" merged_abundance_table.txt | grep -v "t__" | sed 's/^.*s__//g' > merged_abundance_table_species.txt

Abundance data from MetaPhlan can also be treated the same as 16S data — you can import it into other packages that will compute various types of diversity metrics, MDS plots, etc. For this report though, since it is so close to end of term, just focus on the following: 1) what are the biological properties of some of the top taxa in each location? You can use Micropedia or other online sources to get some basic information about them. 2) do the samples cluster as you expect them to (and explain why or why not)?

Comments are closed.