BINF 6203: Whole genome shotgun metagenomics with MetaPhLan

Like last week’s tutorial, this tutorial uses Urban Environmental Genomics Project data. The original version of the tutorial was developed by Anju Lulla for our student interns.

Preparation and software installation

MetaPhlAn can be installed through homebrew but both Anju and I found that there were some problems with that package. The alternative is to clone it: hg clone Note: this will put metaphlan2’s directory as a subdirectory of your current working directory. You will need to chmod a+x You will need to put that directory in your bash path unless you want to type the full path to metaphlan2 every time.

Metaphlan2 has some dependencies that you will need to install on your own, if you choose to clone it.  Use brew install to install (or upgrade) bowtie2. Use pip install to install or upgrade matplotlib and scipy.

After Metaphlan2 is installed, you can be sure it will run by calling: -–h 

Continue reading “BINF 6203: Whole genome shotgun metagenomics with MetaPhLan”

BINF 6203: 16S rRNA classification with QIIME

This tutorial makes use of the data from the Urban Environmental Genomics Project, a collaboration seeded by the Department of Bioinformatics and Genomics and involving participants from our department as well as Civil Engineering, Biology, and Geography and Earth Science. Many thanks to Kevin Lambirth, who created the original version of this tutorial for our lab interns.

Our goal in lab this week is to analyze 16S ribosomal RNA sequences from mixed microbial samples using QIIME. The QIIME analysis will tell us what identifiable microbes are present in the samples (usually at the genus level rather than specific species or strains) and also allow us to perform some high level comparisons between samples.

The samples that you have will come from four locations throughout the Charlotte-Mecklenburg wastewater treatment system. All samples were collected on the same day.

  • RES (residential sewage) 3x
  • UPA (stream background upstream of treatment plant) 3x
  • ATE (aeration tank effluent — inside the treatment plant) 3x
  • DSA (stream, downstream of treatment plant) 3x

Continue reading “BINF 6203: 16S rRNA classification with QIIME”

Differential expression analysis with DE-Seq

With many thanks to Anju Lulla — this is a modification of a protocol she used for the paper we are working on with our collaborators.

To start off this lab, you should have an output file from featurecounts with five columns.  The gene ID column followed by counts from both samples in group A, then counts from both samples in group B.

There’s two ways to run this analysis. You can use R at the command line, or you can use R through the RStudio package. Either can be installed with homebrew.  Install R first if you don’t already have it (brew install R). Then if you want to use RStudio, you can brew install Caskroom/cask/RStudio.

I’m going to do it the RStudio way and walk you through the steps, but you can do it in an R script or just at the command line. The png plots (volcano plot and p-value histogram) may only work in a non-interactive mode (i.e. not in RStudio).

You may want to clear your workspace:


If you don’t have it already, or if you just did a fresh upgrade, you need to get DESeq2 (and before that, Bioconductor):


Then you should load the packages you need using the library command:

(in this view I’ve also got all the Bioconductor base functions and generic functions selected under the Package tab at bottom right, so they are loaded automatically).

Now you can import a dataset as a table (edit to your correct path, obviously).

countdata<- read.table("/Users/cgibas/Desktop/CMCP6countsnew.txt", header = TRUE, row.names = 1)

You can view the file by typing “countdata” at the command line, or see it in your data pane in RStudio. This is what the result should look like:

Convert your data table to a matrix using the matrix command (type into the lower left pane or include this in your script:

countdata<- as.matrix(countdata)

Then define the experimental relationships of your columns and define your data frame:

colData<- data.frame(row.names = colnames(countdata), condition)

Again, you can see what each of these objects contains by typing their name:


Now make a DeSeq data set from your count matrix and data frame information:

dds<- DESeqDataSetFromMatrix(countData = countdata, colData = colData, design = ~condition)

And then order and select some data (these are top 25 rows sorted by the mean of un-normalized counts):

select <- order(rowMeans(counts(dds,normalized=FALSE)),decreasing=TRUE)[1:25]

Apply a log transform, and then your count data can be visualized as a heatmap or a PCA plot:

rld<- rlogTransformation(dds, fitType="mean")

pheatmap(assay(rld)[select,], cluster_rows=FALSE, show_rownames=TRUE,
 cluster_cols=FALSE, annotation_col=df)


plotPCA(rld, intgroup = "condition")


Then, you will apply DESeq to test for statistically significant differential expression.

dds <- DESeq(dds)

And then get your results and order them by p-value:

res <- results(dds)
res <- res[order(res$padj), ]

Which will produce a data frame that looks like this:

Then you can merge your results data frame with your counts data frame to create a table that contains both counts and statistical results:

resdata <- merge(,, normalized=TRUE)), by=”row.names”, sort=FALSE)
names(resdata)[1] <- “Gene”

You can quickly calculate the number of DE (true) and non-DE (false) genes:

deCount <- table(res$padj<0.05)

Write a CSV file that contains all your results, to be read into other programs (even Excel!):

write.csv(resdata, file="difExp_results.csv")

Display the distribution of p-values

png("DE_pvals.png", 1000, 1000, pointsize=20)
hist(res$pvalue, breaks=50, col="grey")

Produce a volcano plot of significance vs. effect size:

volcanoplot with(res, plot(log2FoldChange, -log10(pvalue), pch=20, main=main, ...))
with(subset(res, padjlfcthresh), points(log2FoldChange, -log10(pvalue), pch=20, col="orange", ...))
with(subset(res, padjlfcthresh), points(log2FoldChange, -log10(pvalue), pch=20, col="green", ...))
if (labelsig) {
with(subset(res, padjlfcthresh), textxy(log2FoldChange, -log10(pvalue), labs=Gene, cex=textcx, ...))
legend(legendpos, xjust=1, yjust=1, legend=c(paste("FDR<",sigthresh,sep=""), paste("|LogFC|>",lfcthresh,sep=""), "both"), pch=20, col=c("red","orange","green"))
png("volcano.png", 1200, 1000, pointsize=20)
volcanoplot(resdata, lfcthresh=1, sigthresh=0.05, textcx=.8, xlim=c(-2.3, 2), ylim=c(0,5))

and display a heatmap of significantly DE genes (as opposed to the heatmap of raw counts you produced above:

topVarGenes <- head(order(rowVars(assay(rld)),decreasing=TRUE),20)
mat <- assay(rld)[ topVarGenes, ]
mat <- mat - rowMeans(mat)
df <-
pheatmap(mat, annotation_col=df)

BINF 6203: Expression quantitation with featurecounts

In this lab and the next, we are going to use two different methods to calculate differential expression for the same RNASeq dataset. Note: there is no standalone lab writeup due for the featurecounts part. Instead you’ll do one writeup for this part plus the differential expression analysis with DeSeq.

In a nutshell, we have measured gene expression under two conditions (two replicates each) and we want to find out which genes are the most significantly differentially expressed between the two conditions. We are concerned with two things — magnitude of differential expression, and significance of differential expression. We want to see only genes with both high significance (p value definitely less than 0.05) and large effect size (2-fold change is a common threshold).

Transcript data is handled differently depending on whether or not you have a reference genome to which to map the samples. If you do, you use a pipeline like:

bowtie2 (mapping) –> featurecounts (what it says) –> edgeR (normalization and differential analysis)

If you have no reference genome, you use a pipeline like:

Trinity (transcript assembly) — corset (clustering and counting) –> edgeR (normalization and differential analysis).

Continue reading “BINF 6203: Expression quantitation with featurecounts”

BINF 6203: Simple variant calling

This lab takes us back to the tomato chloroplast genome data again. You’ve QCed the chloroplast data, assembled it, mapped it, used it to learn genome track math with bedtools, and produced a consensus sequence from mapped reads.

From a biological perspective this data is interesting because it represents organelle genomes from a closely related set of tomato cultivars. While all of these cultivars are the same species, tomatoes as different as the German Cherry and the Cherokee Purple have obviously undergone some divergence. Here, you are trying to see where that divergence is represented in the chloroplast sequence.

Your goal in this exercise is to find out:

  • How does the genomic divergence between these cultivars show up in terms of individual sequence variations in the chloroplast genome?
  • Are variations between the genomes primarily SNPs, or small insertions or deletions? We’ll go over the VCF file format in class so you can see how to tell.
  • How are the individual SNPs and/or indels dispersed in the chloroplast genome? Do they disproportionately affect particular genes or regions?
  • Can you see any evidence of heterogeneity in the chloroplast genome sequence in these samples (i.e. more than one variant at the same site, with respect to the reference genome?)

Use the sequencing data for all 12 samples to answer this question.

In your report

Describe the workflow of tools (most of which you have already learned) that you will need to carry out this project, starting from the beginning (assuming you have raw data and need to start from square one). For these data, it may be worth exploring the results of being more strict about quality score cutoffs than you were when the goal was mapping and creating a consensus. But there will also be a trade off with throwing away too much of a data set that is already rather small.

You may need to look at manual pages for some tools that we have already learned in order to tweak how you use them to get the right answers. Be thorough in your description of methods, so that another person could reproduce your work exactly.

Use genome browser plots, tables, or other figures to summarize your findings as appropriate.

One new thing to learn before you start

You’ve already learned most of the tools that you will need to use to complete this project. However, there’s one thing that we haven’t done yet. You need to run samtools mpileup on a bam file, the same thing we did as an intermediate step in the read mapping lab, and use bcftools to produce a variant call file, which you will use directly instead of proceeding to a consensus. This means following the approach in the read mapping lab, but stopping before you produce a consensus fasta with

samtools mpileup -uf reference.genome.fa infile.bam | bcftools call -c filename.out

The bcftools call parameter settings are where you should look if you want to change how many alternate alleles you think are possible (which might be helpful in detecting multiple forms of the chloroplast genome). The current version of bcftools has two calling methods — the multiallelic caller and the standard caller (bedtools call -c). We are looking, of course, at data from a haploid genome and comparing it to a close reference. Thus you might expect that only one variant would exist in the sample and you do not need to use the multiallelic caller. There is some evidence that chloroplast sequences from a single plant may show chimerism (i.e. chloroplasts in different cells will have slightly different sequences, just like human cells at different places in the body can show chimerism due to different somatic mutations). If you used the multiallelic caller you might be able to see such phenomena if they occurred. You can set ploidy and choose which caller to use following the options in the current version of the documentation.

HINTS: If you remember from the bedtools lab a couple of weeks ago, a variant call file in VCF format can be overlapped with a GFF feature file for the same reference genome, and you can make the overlaps show which genes have variants. BCF or VCF files can also be loaded into the genome viewers (IGB, IGV) that you have tried out before.

BINF 6203: Genome track math

You can think of genomic data formats in a few main categories:

  • Sequence data (FASTA, FASTQ)
  • Alignments (BLAST results, SAM and BAM files)
  • Track data (genomic intervals — coordinate range tracks, WIG files — continuous value tracks)
  • Tabular data (many kinds — interval data is a subset)

Genome browsers are designed to get different types of data together using a common reference system of genomic coordinates and display them together. Genome browsers are somewhat less good at integrating data so that you can query it, but there are command line tools and other types of systems that can help with that.

The purpose of today’s lab is to get you familiar with track types and experienced with executing a few common track math operations in the UCSC browser and with bedtools.

Continue reading “BINF 6203: Genome track math”

BINF 6203: Read Mapping

Mapping short sequence reads to a reference sequence is a common task in genomics. Many different results can be extracted from a mapped sequence, depending on the original experimental design that produced the sequence reads and on the analysis that follows the mapping. For example:

  • a genomic consensus for an individual (against the reference genome for that species)
  • location of SNPs and other variations in one genome relative to the other
  • location of expressed transcripts (coding mRNAs, noncoding RNAs such as miRNAs and lncRNAs)
  • quantity of expressed transcripts (for differential expression analysis)
  • location of protein binding sites detected by assays such as ChIP-seq

However, all of these analyses ultimately come back to mapping, and their outcomes depend on the accuracy of the mapping process, on how non-uniquely mapping reads are counted, and on the quality of the reference annotation used to interpret the mapped reads.

Continue reading “BINF 6203: Read Mapping”

BINF 6203: Genome Comparison with Mauve

There are many ways to compare genomes, and these comparisons provide different kinds of information about evolutionary history and shared function.

  • First, how do you decide which genes are “the same” across multiple genomes, i.e. orthologs, genes having a common ancestor and related by speciation.
  • Second, which parts of the nucleotide sequence of multiple genomes align, irrespective of gene boundaries and orthology relationships.
  • Finally, how do we decide what genes “are” and what they do to help interpret the meaning of genomic similarities and differences.

Continue reading “BINF 6203: Genome Comparison with Mauve”

BINF 6203: Bacterial genome annotation with prokka

In previous iterations of this course, we’ve used the RAST/SEED servers to perform annotation. RAST is a highly regarded genome annotation pipeline, and you will easily be able to justify using it to reviewers if you are ever writing a bacterial genome paper. However, it can be cumbersome to use the myRAST interface, the command line tools are not particularly well documented, and it can take a day or more to run. So this year we’re going to try a newer method, Torsten Seemann’s Prokka software. If you are interested in learning how to use RAST, the previous year’s tutorial is still available here.

Why use Prokka? First, because in a benchmark test it has been shown to be as or more accurate at reproducing known annotation than RAST or xBASE2 in most annotation categories. Second, because it’s fast and you can run it on a standard laptop in less than ten minutes.

Continue reading “BINF 6203: Bacterial genome annotation with prokka”

BINF 6203: Sequence assembly with SPAdes

For today’s class, I have prepared a shared dropbox file that contains the following items:

  • sequence reads from one of the better chloroplast genome samples generated with our Ion Torrent instrument
  • ERR008613 (a set of paired end Illumina sequence reads from ends of 200bp E. coli fragments — sampled using seqtk)
  • ERR022075 (a set of paired end Illumina sequence reads from ends of 600bp E. coli fragments — sampled using seqtk)
  • sets of PacBio CCS and CLR reads for E. coli

In the first part of the exercise, you will assemble the Ion Torrent reads alone to see what kind of assembly you get for the chloroplast genome, and to learn the basics of SPAdes.

In the second part, you will progressively introduce long-fragment and long-read data to your E. coli assembly, to see how the addition of these data types changes the assembly results.

Continue reading “BINF 6203: Sequence assembly with SPAdes”