Gene expression analysis with DE-Seq

Gene expression analysis with DE-Seq

With many thanks to Anju Lulla — this is a modification of a protocol she used for the paper we are working on with our collaborators.

To start off this lab, you should have an output file from featurecounts with five columns.  The gene ID column followed by counts from both samples in group A, then counts from both samples in group B.

There’s two ways to run this analysis. You can use R at the command line, or you can use R through the RStudio package. Either can be installed with homebrew.  Install R first if you don’t already have it (brew install R). Then if you want to use RStudio, you can brew install Caskroom/cask/RStudio.

I’m going to do it the RStudio way and walk you through the steps, but you can do it in an R script or just at the command line. The png plots (volcano plot and p-value histogram) may only work in a non-interactive mode (i.e. not in RStudio).

You may want to clear your workspace:

rm(list=ls())

If you don’t have it already, or if you just did a fresh upgrade, you need to get DESeq2 (and before that, Bioconductor):

source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")

Then you should load the packages you need using the library command:

(in this view I’ve also got all the Bioconductor base functions and generic functions selected under the Package tab at bottom right, so they are loaded automatically).

Now you can import a dataset as a table (edit to your correct path, obviously).

countdata<- read.table("/Users/cgibas/Desktop/CMCP6countsnew.txt", header = TRUE, row.names = 1)

You can view the file by typing “countdata” at the command line, or see it in your data pane in RStudio. This is what the result should look like:

Convert your data table to a matrix using the matrix command (type into the lower left pane or include this in your script:

countdata<- as.matrix(countdata)

Then define the experimental relationships of your columns and define your data frame:

condition<-factor(c(rep("A",2),rep("B",2)))
colData<- data.frame(row.names = colnames(countdata), condition)

Again, you can see what each of these objects contains by typing their name:

colData

Now make a DeSeq data set from your count matrix and data frame information:

dds<- DESeqDataSetFromMatrix(countData = countdata, colData = colData, design = ~condition)

And then order and select some data (these are top 25 rows sorted by the mean of un-normalized counts):

select <- order(rowMeans(counts(dds,normalized=FALSE)),decreasing=TRUE)[1:25]

Apply a log transform, and then your count data can be visualized as a heatmap or a PCA plot:

rld<- rlogTransformation(dds, fitType="mean")

pheatmap(assay(rld)[select,], cluster_rows=FALSE, show_rownames=TRUE,
 cluster_cols=FALSE, annotation_col=df)

 

plotPCA(rld, intgroup = "condition")

 

Then, you will apply DESeq to test for statistically significant differential expression.

dds <- DESeq(dds)

And then get your results and order them by p-value:

res <- results(dds)
res <- res[order(res$padj), ]

Which will produce a data frame that looks like this:

Then you can merge your results data frame with your counts data frame to create a table that contains both counts and statistical results:

resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by=”row.names”, sort=FALSE)
names(resdata)[1] <- “Gene”

You can quickly calculate the number of DE (true) and non-DE (false) genes:

deCount <- table(res$padj<0.05)

Write a CSV file that contains all your results, to be read into other programs (even Excel!):

write.csv(resdata, file="difExp_results.csv")

Display the distribution of p-values

png("DE_pvals.png", 1000, 1000, pointsize=20)
hist(res$pvalue, breaks=50, col="grey")

Produce a volcano plot of significance vs. effect size:

volcanoplot with(res, plot(log2FoldChange, -log10(pvalue), pch=20, main=main, ...))
with(subset(res, padjlfcthresh), points(log2FoldChange, -log10(pvalue), pch=20, col="orange", ...))
with(subset(res, padjlfcthresh), points(log2FoldChange, -log10(pvalue), pch=20, col="green", ...))
if (labelsig) {
require(calibrate)
with(subset(res, padjlfcthresh), textxy(log2FoldChange, -log10(pvalue), labs=Gene, cex=textcx, ...))
}
legend(legendpos, xjust=1, yjust=1, legend=c(paste("FDR<",sigthresh,sep=""), paste("|LogFC|>",lfcthresh,sep=""), "both"), pch=20, col=c("red","orange","green"))
}
png("volcano.png", 1200, 1000, pointsize=20)
volcanoplot(resdata, lfcthresh=1, sigthresh=0.05, textcx=.8, xlim=c(-2.3, 2), ylim=c(0,5))

and display a heatmap of significantly DE genes (as opposed to the heatmap of raw counts you produced above:

library("genefilter")
topVarGenes <- head(order(rowVars(assay(rld)),decreasing=TRUE),20)
mat <- assay(rld)[ topVarGenes, ]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(rld))
pheatmap(mat, annotation_col=df)

Comments are closed.