Hugo Varet, Julie Aubert and Jacques van Helden
2017-01-08
For people using their own laptop, install some R packages:
source("http://bioconductor.org/biocLite.R")
if (!require(DESeq2)) {
biocLite("DESeq2", ask=FALSE)
}
if (!require(edgeR)) {
biocLite("edgeR", ask=FALSE)
}
if (!require(gplots)) {
biocLite("gplots", ask=FALSE)
}
Study of 48 WT yeast samples vs 48 Snf2 (KO) samples:
Gierliński,M., Cole,C., Schofield,P., Schurch,N.J., Sherstnev,A., Singh,V., Wrobel,N., Gharbi,K., Simpson,G., Owen-Hughes,T., et al. (2015) Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment. Bioinformatics, 10.1093/bioinformatics/btv425. pdf
Schurch,N.J., Schofield,P., Gierliński,M., Cole,C., Sherstnev,A., Singh,V., Wrobel,N., Gharbi,K., Simpson,G.G., Owen-Hughes,T., et al. (2016) How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA, 10.1261/rna.053959.115.
RNA-Seq reads have been cleaned, mapped and counted to generated a count data matrix containing 7126 genes.
R enables to download data directly from the Web. Load the counts table containing one row per gene and one column per sample.
# Load the files content in an R data.frame
path.counts <- "http://jvanheld.github.io/stats_avec_RStudio_EBA/practicals/yeast_2x48_replicates/data/counts.txt"
counts <- read.table(file=path.counts, sep="\t", header=TRUE, row.names=1)
path.expDesign <- "http://jvanheld.github.io/stats_avec_RStudio_EBA/practicals/yeast_2x48_replicates/data/expDesign.txt"
expDesign <- read.table(file=path.expDesign, sep="\t", header=TRUE)
# look at the beginning of the counts and design table:
print(counts[1:4,1:4])
WT1 WT2 WT3 WT4
15s_rrna 2 12 31 8
21s_rrna 20 76 101 99
hra1 3 2 2 2
icr1 75 123 107 157
print(expDesign[1:4,])
label strain
1 WT1 WT
2 WT2 WT
3 WT3 WT
4 WT4 WT
# dimension of each table
dim(counts)
[1] 7126 96
dim(expDesign)
[1] 96 2
View(counts)
View(expDesign)
Be careful to the reference level in the factor variables:
print(expDesign$strain)
[1] WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf
[75] Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf
Levels: Snf WT
expDesign$strain <- relevel(expDesign$strain, "WT")
print(expDesign$strain)
[1] WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT WT Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf
[75] Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf
Levels: WT Snf
Note the change: the first level is not WT. This genotype will thus be considered as reference for the differential analysis.
barplot(colSums(counts)/1000000, main="Total number of reads per sample (million)")
prop.null <- apply(counts, 2, function(x) 100*mean(x==0))
print(head(prop.null))
WT1 WT2 WT3 WT4 WT5 WT6
9.219759 7.269155 7.367387 7.072692 7.367387 6.048274
barplot(prop.null, main="Percentage of null counts per sample", las=1)
# Load the DESeq2 R package
library(DESeq2)
# Build the DESeq2 main object with the count data + experimental design
dds0 <- DESeqDataSetFromMatrix(countData = counts, colData = expDesign, design = ~ strain)
# Print a summary of the data content
print(dds0)
class: DESeqDataSet
dim: 7126 96
metadata(1): version
assays(1): counts
rownames(7126): 15s_rrna 21s_rrna ... ty(gua)o ty(gua)q
rowData names(0):
colnames(96): WT1 WT2 ... Snf47 Snf48
colData names(2): label strain
# Detect differentially expressed genes
dds0 <- DESeq(dds0)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 10 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
res0 <- results(dds0)
print(res0)
log2 fold change (MAP): strain Snf vs WT
Wald test p-value: strain Snf vs WT
DataFrame with 7126 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
15s_rrna 18.336648 0.14533034 0.29148181 0.4985914 6.180672e-01 6.737305e-01
21s_rrna 107.325458 -0.08431727 0.25187200 -0.3347624 7.378043e-01 7.824543e-01
hra1 2.526211 -0.74324768 0.20824687 -3.5690701 3.582505e-04 6.503087e-04
icr1 141.574248 0.21494348 0.03695485 5.8163804 6.013555e-09 1.626683e-08
lsr1 207.526479 -0.13222494 0.15297933 -0.8643321 3.874055e-01 4.497744e-01
... ... ... ... ... ... ...
ty(gua)j2 0.1433690 -0.09418330 0.4293068 -0.2193846 0.8263505 0.8592897
ty(gua)m1 0.3670378 -0.24446546 0.3635025 -0.6725277 0.5012478 0.5634232
ty(gua)m2 0.1079545 -0.14671918 0.4051056 -0.3621752 0.7172211 0.7635234
ty(gua)o 0.1136899 -0.05217501 0.4147428 -0.1258009 0.8998895 0.9210194
ty(gua)q 0.0000000 NA NA NA NA NA
print(summary(res0))
out of 6887 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 2613, 38%
LFC < 0 (down) : 2522, 37%
outliers [1] : 0, 0%
low counts [2] : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
NULL
print(mcols(res0))
DataFrame with 6 rows and 2 columns
type description
<character> <character>
1 intermediate mean of normalized counts for all samples
2 results log2 fold change (MAP): strain Snf vs WT
3 results standard error: strain Snf vs WT
4 results Wald statistic: strain Snf vs WT
5 results Wald test p-value: strain Snf vs WT
6 results BH adjusted p-values
Question: how robust is the analysis to the number of samples?
The original study contained 48 replicates per genotype, what happens if we select a smaller number?
Test: each attendee of this course could select a given number (e.g. 3,4,5,10,15,20) and adapt the code on the next slide to run the analysis with that number of replicates per genotype. We then compare the results (number of genes, significance, …).
nb.replicates <- 4 ## Each attendee chooses a number (3,4,5,10,15 or 20)
## Random sampling of the WT replicates (columns 1 to 48)
samples.WT <- sample(1:48, size=nb.replicates, replace=FALSE)
## Random sampling of the Snf2 replicates (columns 49 to 96)
samples.Snf2 <- sample(49:96, size=nb.replicates, replace=FALSE)
print(c(samples.WT, samples.Snf2))
[1] 9 30 23 1 68 60 74 64
dds <- DESeqDataSetFromMatrix(
countData = counts[,c(samples.WT, samples.Snf2)],
colData = expDesign[c(samples.WT, samples.Snf2),],
design = ~ strain)
print(dds)
class: DESeqDataSet
dim: 7126 8
metadata(1): version
assays(1): counts
rownames(7126): 15s_rrna 21s_rrna ... ty(gua)o ty(gua)q
rowData names(0):
colnames(8): WT9 WT30 ... Snf26 Snf16
colData names(2): label strain
We now perform a differential analysis with DESeq2 step by step with some quality controls.
dds <- estimateSizeFactors(dds)
print(sizeFactors(dds))
WT9 WT30 WT23 WT1 Snf20 Snf12 Snf26 Snf16
0.8876435 1.2624169 0.8726911 0.5842733 1.1247511 0.9706899 1.3155944 1.2363773
# effect of the normalization
par(mfrow=c(1,2))
boxplot(log2(counts(dds, normalized=FALSE)+1), main="Raw counts", col=rep(c("lightblue","orange"), each=nb.replicates))
boxplot(log2(counts(dds, normalized=TRUE)+1), main="Normalized counts", col=rep(c("lightblue","orange"), each=nb.replicates))
mean.counts <- rowMeans(counts(dds))
variance.counts <- apply(counts(dds), 1, var)
plot(x=mean.counts, y=variance.counts, pch=16, cex=0.3, main="Mean-variance relationship", log="xy")
abline(a=0, b=1)
We observe over-dispersion in the data: the Poisson distribution is not adapted and we prefer the Negative-Binommial distribution.
# dispersions estimation
dds <- estimateDispersions(dds)
# make the data homoscedastic
rld <- rlog(dds) # alternative to the "Variance Stabilizing Transformation"
plotPCA(rld, intgroup="strain") # function from the DESeq2 package
dds <- nbinomWaldTest(dds)
res.DESeq2 <- results(dds, alpha=0.05, pAdjustMethod="BH")
print(head(res.DESeq2))
log2 fold change (MAP): strain Snf vs WT
Wald test p-value: strain Snf vs WT
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
15s_rrna 7.185776 -0.75416098 0.4018401 -1.8767688 0.06054979 0.13454143
21s_rrna 51.015976 -0.82018629 0.3301773 -2.4840785 0.01298872 0.03845082
hra1 2.645990 -0.85128415 0.4361491 -1.9518190 0.05095969 0.11715733
icr1 144.123121 0.16208004 0.1394587 1.1622085 0.24515079 0.38638024
lsr1 159.760490 -0.42144241 0.2540594 -1.6588341 0.09714922 0.19277177
nme1 29.166050 0.09574963 0.2798070 0.3421988 0.73220128 0.82741283
summary(res.DESeq2, alpha=0.05)
out of 6831 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 1053, 15%
LFC < 0 (down) : 1276, 19%
outliers [1] : 0, 0%
low counts [2] : 265, 3.9%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
hist(res.DESeq2$pvalue, col="lightblue", main="Histogram of raw P-values (DESeq2)", breaks=20, xlab="P-value")
plotMA(res.DESeq2, alpha=0.05) # function from the DESeq2 package
Here we need to build the plot using R base functions:
## Draw a volcano plot
plot(x=res.DESeq2$log2FoldChange, y=-log10(res.DESeq2$padj),
xlab="log2(Fold-Change)", ylab="-log10(adjusted P-value",
col=ifelse(res.DESeq2$padj<=0.05, "red", "black"), main="Volcano plot")
grid() ## Add a grid to the plot
abline(v=0) ## plot the X axis
# load the edgeR R package
library(edgeR)
# create the edgeR main object: dge
dge <- DGEList(counts=counts[,c(samples.WT, samples.Snf2)], remove.zeros=FALSE)
dge$design <- model.matrix(~ strain, data=expDesign[c(samples.WT, samples.Snf2),])
print(dge)
An object of class "DGEList"
$counts
WT9 WT30 WT23 WT1 Snf20 Snf12 Snf26 Snf16
15s_rrna 7 13 15 2 5 4 8 5
21s_rrna 57 61 109 20 29 23 74 38
hra1 4 3 4 3 1 2 0 2
icr1 119 186 115 75 157 162 215 175
lsr1 211 177 226 60 144 178 153 137
7121 more rows ...
$samples
group lib.size norm.factors
WT9 1 8855415 1
WT30 1 11556151 1
WT23 1 7640884 1
WT1 1 5964116 1
Snf20 1 9107888 1
Snf12 1 7401794 1
Snf26 1 9812933 1
Snf16 1 9073818 1
$design
(Intercept) strainSnf
9 1 0
30 1 0
23 1 0
1 1 0
68 1 1
60 1 1
74 1 1
64 1 1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$strain
[1] "contr.treatment"
# normalization
dge <- calcNormFactors(dge)
print(dge$samples$norm.factors)
[1] 0.8617310 0.9259459 0.9708987 0.8406503 1.0499755 1.1220594 1.1328310 1.1505176
# dispersions
dge <- estimateGLMCommonDisp(dge, dge$design)
dge <- estimateGLMTrendedDisp(dge, dge$design)
dge <- estimateGLMTagwiseDisp(dge, dge$design)
fit <- glmFit(dge, dge$design)
print(dge$design)
(Intercept) strainSnf
9 1 0
30 1 0
23 1 0
1 1 0
68 1 1
60 1 1
74 1 1
64 1 1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$strain
[1] "contr.treatment"
lrt <- glmLRT(fit)
res.edgeR <- topTags(lrt,n=nrow(dge$counts),adjust.method="BH")$table
print(head(res.edgeR))
logFC logCPM LR PValue FDR
yor290c -8.191709 6.616968 1472.9374 2.964394e-322 2.112427e-318
ygr234w -4.212186 8.492503 765.4342 1.768090e-168 6.299706e-165
yhr215w -4.319150 7.242903 739.9788 6.058869e-163 1.439183e-159
yar071w -4.132219 7.587590 667.7889 3.023081e-147 5.385619e-144
yer011w -4.311996 7.617404 649.3290 3.126342e-143 4.455662e-140
yml123c -4.749141 9.302036 619.4639 9.780380e-137 1.161583e-133
hist(res.edgeR$PValue, col="lightblue", main="Histogram of raw P-values (edgeR)", breaks=20, xlab="P-value")
plot(x=sizeFactors(dds), y=dge$samples$norm.factors, xlab="DESeq2 size factors", ylab="edgeR normalization factors", pch=16)
The normalization/size factors computed by DESeq2 and edgeR are not comparable as they are used in a different manner in the statistical/mathematical models.
print(head(res.DESeq2))
log2 fold change (MAP): strain Snf vs WT
Wald test p-value: strain Snf vs WT
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
15s_rrna 7.185776 -0.75416098 0.4018401 -1.8767688 0.06054979 0.13454143
21s_rrna 51.015976 -0.82018629 0.3301773 -2.4840785 0.01298872 0.03845082
hra1 2.645990 -0.85128415 0.4361491 -1.9518190 0.05095969 0.11715733
icr1 144.123121 0.16208004 0.1394587 1.1622085 0.24515079 0.38638024
lsr1 159.760490 -0.42144241 0.2540594 -1.6588341 0.09714922 0.19277177
nme1 29.166050 0.09574963 0.2798070 0.3421988 0.73220128 0.82741283
print(head(res.edgeR))
logFC logCPM LR PValue FDR
yor290c -8.191709 6.616968 1472.9374 2.964394e-322 2.112427e-318
ygr234w -4.212186 8.492503 765.4342 1.768090e-168 6.299706e-165
yhr215w -4.319150 7.242903 739.9788 6.058869e-163 1.439183e-159
yar071w -4.132219 7.587590 667.7889 3.023081e-147 5.385619e-144
yer011w -4.311996 7.617404 649.3290 3.126342e-143 4.455662e-140
yml123c -4.749141 9.302036 619.4639 9.780380e-137 1.161583e-133
res.edgeR <- res.edgeR[order(rownames(res.edgeR)),]
res.DESeq2 <- res.DESeq2[order(rownames(res.DESeq2)),]
plot(x=res.edgeR$logFC, y=res.DESeq2$log2FoldChange,
pch=16, cex=0.4, xlim=c(-2,2), ylim=c(-2,2),
xlab="edgeR log2(Fold-Change)", ylab="DESeq2 log2(Fold-Change)")
abline(a=0, b=1, col="red", lwd=4) # draw the y=x curve (y=a+b*x with a=0 and b=1)
abline(h=0, v=0) # horizontal and vertical line
plot(x=res.edgeR$PValue, y=res.DESeq2$pvalue,
pch=16, cex=0.4, xlab="edgeR raw P-value", ylab="DESeq2 raw P-value")
abline(a=0, b=1, col="red", lwd=4) # draw the y=x curve (y=a+b*x with a=0 and b=1)
# remember the number of replicates
print(nb.replicates)
[1] 4
# DESeq2
sum(res.DESeq2$padj <= 0.05, na.rm=TRUE)
[1] 2329
# edgeR
sum(res.edgeR$FDR <= 0.05, na.rm=TRUE)
[1] 2180
What’s the behaviour of the number of differentially expressed genes according to the number of samples?
library(gplots)
venn(list(DESeq2=rownames(res.DESeq2[which(res.DESeq2$padj <= 0.05),]),
edgeR=rownames(res.edgeR[which(res.edgeR$FDR <= 0.05),])))
Supplementary exercise: do the same plot for up- and down-regulated genes separately.
DESeq2.genes <- rownames(res.DESeq2[which(res.DESeq2$padj <= 0.05),])
edgeR.genes <- rownames(res.edgeR[which(res.edgeR$FDR <= 0.05),])
# select a DESeq2 specific gene
spe.DESeq2 <- setdiff(DESeq2.genes, edgeR.genes)
summary(res.edgeR[spe.DESeq2,"FDR"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.05015 0.05863 0.07199 0.08185 0.09347 0.30840
# select a edgeR specific gene
spe.edgeR <- setdiff(edgeR.genes, DESeq2.genes)
summary(res.DESeq2[spe.edgeR,"padj"])
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.05001 0.05558 0.06183 0.07474 0.08369 0.20590 3
# plotCounts is a function from the DESeq2 R package
plotCounts(dds, gene="ycr017c", intgroup="strain", normalized=TRUE)
Question: how reliable are the genes declared “positive”, i.e. differentially expressed?
In principle, a good DEG detection program should return a negative answer when the dataset contains no differentially expressed gene.
Test: run DESeq2 and edgeR on a tricky dataset, built by selecting two subsets of replicates in the same genotype. Since the two sample sets are obtained in the same conditions, the analysis should return no DEG gene.
This is a negative control: we build an experiment whose result is expected to be negative.
Here we perform a differential analysis in which we compare N WT samples vs N other WT samples.
nb.replicates <- 10
samples.WT <- sample(1:48, size=2*nb.replicates, replace=FALSE)
print(samples.WT)
[1] 30 15 41 22 10 20 23 28 13 19 47 6 7 1 43 8 32 33 2 24
counts.H0 <- counts[,samples.WT]
expDesign.H0 <- expDesign[samples.WT,]
# add a fictive condition factor
expDesign.H0$condition <- factor(rep(c("A","B"), each=nb.replicates))
print(expDesign.H0)
label strain condition
30 WT30 WT A
15 WT15 WT A
41 WT41 WT A
22 WT22 WT A
10 WT10 WT A
20 WT20 WT A
23 WT23 WT A
28 WT28 WT A
13 WT13 WT A
19 WT19 WT A
47 WT47 WT B
6 WT6 WT B
7 WT7 WT B
1 WT1 WT B
43 WT43 WT B
8 WT8 WT B
32 WT32 WT B
33 WT33 WT B
2 WT2 WT B
24 WT24 WT B
dds.H0 <- DESeqDataSetFromMatrix(countData = counts.H0, colData = expDesign.H0, design = ~ condition)
dds.H0 <- DESeq(dds.H0)
res.H0 <- results(dds.H0)
summary(res.H0)
out of 6870 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 0, 0%
LFC < 0 (down) : 0, 0%
outliers [1] : 0, 0%
low counts [2] : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
Here are the details of the R packages used to generate this document:
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.2
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] gplots_3.0.1 edgeR_3.14.0 limma_3.28.21 DESeq2_1.12.4 SummarizedExperiment_1.2.3 Biobase_2.32.0 GenomicRanges_1.24.3 GenomeInfoDb_1.8.7 IRanges_2.6.1 S4Vectors_0.10.3
[11] BiocGenerics_0.18.0
loaded via a namespace (and not attached):
[1] genefilter_1.54.2 gtools_3.5.0 locfit_1.5-9.1 splines_3.3.2 lattice_0.20-34 colorspace_1.3-2 htmltools_0.3.5 yaml_2.1.14 survival_2.40-1 XML_3.98-1.5 foreign_0.8-67 DBI_0.5-1 BiocParallel_1.6.6 RColorBrewer_1.1-2
[15] plyr_1.8.4 stringr_1.1.0 zlibbioc_1.18.0 munsell_0.4.3 gtable_0.2.0 caTools_1.17.1 evaluate_0.10 memoise_1.0.0 labeling_0.3 latticeExtra_0.6-28 knitr_1.15.1 geneplotter_1.50.0 AnnotationDbi_1.34.4 htmlTable_1.7
[29] Rcpp_0.12.8 KernSmooth_2.23-15 acepack_1.4.1 xtable_1.8-2 openssl_0.9.5 scales_0.4.1 backports_1.0.4 gdata_2.17.0 base64_2.0 Hmisc_4.0-1 annotate_1.50.1 XVector_0.12.1 gridExtra_2.2.1 ggplot2_2.2.0
[43] digest_0.6.10 stringi_1.1.2 grid_3.3.2 rprojroot_1.1 tools_3.3.2 bitops_1.0-6 magrittr_1.5 lazyeval_0.2.0 RCurl_1.95-4.8 tibble_1.2 RSQLite_1.1-1 Formula_1.2-1 cluster_2.0.5 Matrix_1.2-7.1
[57] data.table_1.10.0 assertthat_0.1 rmarkdown_1.3 rpart_4.1-10 nnet_7.3-12