Practical: exploring RNA-Seq counts

Hugo Varet, Julie Aubert and Jacques van Helden

2017-01-08

Requirements

For people using their own laptop, install some R packages:

source("http://bioconductor.org/biocLite.R")
if (!require(DESeq2)) {
  biocLite("DESeq2", ask=FALSE)
}
if (!require(edgeR)) {
  biocLite("edgeR", ask=FALSE)
}
if (!require(gplots)) {
  biocLite("gplots", ask=FALSE)
}

Context

Loading a data table

R enables to download data directly from the Web. Load the counts table containing one row per gene and one column per sample.

# Load the files content in an R data.frame
path.counts <- "http://jvanheld.github.io/stats_avec_RStudio_EBA/practicals/yeast_2x48_replicates/data/counts.txt"
counts <- read.table(file=path.counts, sep="\t", header=TRUE, row.names=1)

path.expDesign <- "http://jvanheld.github.io/stats_avec_RStudio_EBA/practicals/yeast_2x48_replicates/data/expDesign.txt"
expDesign <- read.table(file=path.expDesign, sep="\t", header=TRUE)

Checking the content of the count tables

# look at the beginning of the counts and design table:
print(counts[1:4,1:4])
         WT1 WT2 WT3 WT4
15s_rrna   2  12  31   8
21s_rrna  20  76 101  99
hra1       3   2   2   2
icr1      75 123 107 157
print(expDesign[1:4,])
  label strain
1   WT1     WT
2   WT2     WT
3   WT3     WT
4   WT4     WT
# dimension of each table
dim(counts)
[1] 7126   96
dim(expDesign)
[1] 96  2
View(counts)
View(expDesign)

Factors and levels in R

Be careful to the reference level in the factor variables:

print(expDesign$strain)
 [1] WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf
[75] Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf
Levels: Snf WT
expDesign$strain <- relevel(expDesign$strain, "WT")
print(expDesign$strain)
 [1] WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  WT  Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf
[75] Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf Snf
Levels: WT Snf

Note the change: the first level is not WT. This genotype will thus be considered as reference for the differential analysis.

Basic description of the data: number of reads per sample

barplot(colSums(counts)/1000000, main="Total number of reads per sample (million)")

Basic description of the data: percentage of null counts per sample

prop.null <- apply(counts, 2, function(x) 100*mean(x==0))
print(head(prop.null))
     WT1      WT2      WT3      WT4      WT5      WT6 
9.219759 7.269155 7.367387 7.072692 7.367387 6.048274 
barplot(prop.null, main="Percentage of null counts per sample", las=1)

Differential analysis with DESeq2

# Load the DESeq2 R package
library(DESeq2)

# Build the DESeq2 main object with the count data + experimental design
dds0 <- DESeqDataSetFromMatrix(countData = counts, colData = expDesign, design = ~ strain)

# Print a summary of the data content
print(dds0)
class: DESeqDataSet 
dim: 7126 96 
metadata(1): version
assays(1): counts
rownames(7126): 15s_rrna 21s_rrna ... ty(gua)o ty(gua)q
rowData names(0):
colnames(96): WT1 WT2 ... Snf47 Snf48
colData names(2): label strain

Get the results using two command lines

# Detect differentially expressed genes
dds0 <- DESeq(dds0) 
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 10 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
res0 <- results(dds0)
print(res0)
log2 fold change (MAP): strain Snf vs WT 
Wald test p-value: strain Snf vs WT 
DataFrame with 7126 rows and 6 columns
            baseMean log2FoldChange      lfcSE       stat       pvalue         padj
           <numeric>      <numeric>  <numeric>  <numeric>    <numeric>    <numeric>
15s_rrna   18.336648     0.14533034 0.29148181  0.4985914 6.180672e-01 6.737305e-01
21s_rrna  107.325458    -0.08431727 0.25187200 -0.3347624 7.378043e-01 7.824543e-01
hra1        2.526211    -0.74324768 0.20824687 -3.5690701 3.582505e-04 6.503087e-04
icr1      141.574248     0.21494348 0.03695485  5.8163804 6.013555e-09 1.626683e-08
lsr1      207.526479    -0.13222494 0.15297933 -0.8643321 3.874055e-01 4.497744e-01
...              ...            ...        ...        ...          ...          ...
ty(gua)j2  0.1433690    -0.09418330  0.4293068 -0.2193846    0.8263505    0.8592897
ty(gua)m1  0.3670378    -0.24446546  0.3635025 -0.6725277    0.5012478    0.5634232
ty(gua)m2  0.1079545    -0.14671918  0.4051056 -0.3621752    0.7172211    0.7635234
ty(gua)o   0.1136899    -0.05217501  0.4147428 -0.1258009    0.8998895    0.9210194
ty(gua)q   0.0000000             NA         NA         NA           NA           NA
print(summary(res0))

out of 6887 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 2613, 38% 
LFC < 0 (down)   : 2522, 37% 
outliers [1]     : 0, 0% 
low counts [2]   : 0, 0% 
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

NULL
print(mcols(res0))
DataFrame with 6 rows and 2 columns
          type                               description
   <character>                               <character>
1 intermediate mean of normalized counts for all samples
2      results  log2 fold change (MAP): strain Snf vs WT
3      results          standard error: strain Snf vs WT
4      results          Wald statistic: strain Snf vs WT
5      results       Wald test p-value: strain Snf vs WT
6      results                      BH adjusted p-values

How many replicates are required?

Question: how robust is the analysis to the number of samples?

The original study contained 48 replicates per genotype, what happens if we select a smaller number?

Test: each attendee of this course could select a given number (e.g. 3,4,5,10,15,20) and adapt the code on the next slide to run the analysis with that number of replicates per genotype. We then compare the results (number of genes, significance, …).

Sub-sampling: analysis using a few replicates

nb.replicates <- 4 ## Each attendee chooses a number (3,4,5,10,15 or 20)
## Random sampling of the WT replicates (columns 1 to 48)
samples.WT <- sample(1:48, size=nb.replicates, replace=FALSE)
## Random sampling of the Snf2 replicates (columns 49 to 96)
samples.Snf2 <- sample(49:96, size=nb.replicates, replace=FALSE)
print(c(samples.WT, samples.Snf2))
[1]  9 30 23  1 68 60 74 64
dds <- DESeqDataSetFromMatrix(
  countData = counts[,c(samples.WT, samples.Snf2)], 
  colData = expDesign[c(samples.WT, samples.Snf2),], 
  design = ~ strain)
print(dds)
class: DESeqDataSet 
dim: 7126 8 
metadata(1): version
assays(1): counts
rownames(7126): 15s_rrna 21s_rrna ... ty(gua)o ty(gua)q
rowData names(0):
colnames(8): WT9 WT30 ... Snf26 Snf16
colData names(2): label strain

We now perform a differential analysis with DESeq2 step by step with some quality controls.

Normalization

dds <- estimateSizeFactors(dds)
print(sizeFactors(dds))
      WT9      WT30      WT23       WT1     Snf20     Snf12     Snf26     Snf16 
0.8876435 1.2624169 0.8726911 0.5842733 1.1247511 0.9706899 1.3155944 1.2363773 
# effect of the normalization
par(mfrow=c(1,2))
boxplot(log2(counts(dds, normalized=FALSE)+1), main="Raw counts", col=rep(c("lightblue","orange"), each=nb.replicates))
boxplot(log2(counts(dds, normalized=TRUE)+1), main="Normalized counts", col=rep(c("lightblue","orange"), each=nb.replicates))

Mean-variance relationship

mean.counts <- rowMeans(counts(dds))
variance.counts <- apply(counts(dds), 1, var)
plot(x=mean.counts, y=variance.counts, pch=16, cex=0.3, main="Mean-variance relationship", log="xy")
abline(a=0, b=1)

We observe over-dispersion in the data: the Poisson distribution is not adapted and we prefer the Negative-Binommial distribution.

Principal Component Analysis (PCA)

# dispersions estimation
dds <- estimateDispersions(dds)
# make the data homoscedastic
rld <- rlog(dds) # alternative to the "Variance Stabilizing Transformation"
plotPCA(rld, intgroup="strain") # function from the DESeq2 package

Statistical test for each gene

dds <- nbinomWaldTest(dds)
res.DESeq2 <- results(dds, alpha=0.05, pAdjustMethod="BH")
print(head(res.DESeq2))
log2 fold change (MAP): strain Snf vs WT 
Wald test p-value: strain Snf vs WT 
DataFrame with 6 rows and 6 columns
           baseMean log2FoldChange     lfcSE       stat     pvalue       padj
          <numeric>      <numeric> <numeric>  <numeric>  <numeric>  <numeric>
15s_rrna   7.185776    -0.75416098 0.4018401 -1.8767688 0.06054979 0.13454143
21s_rrna  51.015976    -0.82018629 0.3301773 -2.4840785 0.01298872 0.03845082
hra1       2.645990    -0.85128415 0.4361491 -1.9518190 0.05095969 0.11715733
icr1     144.123121     0.16208004 0.1394587  1.1622085 0.24515079 0.38638024
lsr1     159.760490    -0.42144241 0.2540594 -1.6588341 0.09714922 0.19277177
nme1      29.166050     0.09574963 0.2798070  0.3421988 0.73220128 0.82741283
summary(res.DESeq2, alpha=0.05)

out of 6831 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 1053, 15% 
LFC < 0 (down)   : 1276, 19% 
outliers [1]     : 0, 0% 
low counts [2]   : 265, 3.9% 
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Histogram of raw P-values

hist(res.DESeq2$pvalue, col="lightblue", main="Histogram of raw P-values (DESeq2)", breaks=20, xlab="P-value")

MA-plot

plotMA(res.DESeq2, alpha=0.05) # function from the DESeq2 package

Volcano-plot

Here we need to build the plot using R base functions:

## Draw a volcano plot
plot(x=res.DESeq2$log2FoldChange, y=-log10(res.DESeq2$padj), 
     xlab="log2(Fold-Change)", ylab="-log10(adjusted P-value",
     col=ifelse(res.DESeq2$padj<=0.05, "red", "black"), main="Volcano plot")
grid() ## Add a grid to the plot
abline(v=0) ## plot the X axis

Differential analysis using edgeR with a few replicates

# load the edgeR R package
library(edgeR)
# create the edgeR main object: dge
dge <- DGEList(counts=counts[,c(samples.WT, samples.Snf2)], remove.zeros=FALSE)
dge$design <- model.matrix(~ strain, data=expDesign[c(samples.WT, samples.Snf2),])
print(dge)
An object of class "DGEList"
$counts
         WT9 WT30 WT23 WT1 Snf20 Snf12 Snf26 Snf16
15s_rrna   7   13   15   2     5     4     8     5
21s_rrna  57   61  109  20    29    23    74    38
hra1       4    3    4   3     1     2     0     2
icr1     119  186  115  75   157   162   215   175
lsr1     211  177  226  60   144   178   153   137
7121 more rows ...

$samples
      group lib.size norm.factors
WT9       1  8855415            1
WT30      1 11556151            1
WT23      1  7640884            1
WT1       1  5964116            1
Snf20     1  9107888            1
Snf12     1  7401794            1
Snf26     1  9812933            1
Snf16     1  9073818            1

$design
   (Intercept) strainSnf
9            1         0
30           1         0
23           1         0
1            1         0
68           1         1
60           1         1
74           1         1
64           1         1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$strain
[1] "contr.treatment"

Normalization & dispersions estimation with edgeR

# normalization
dge <- calcNormFactors(dge)
print(dge$samples$norm.factors)
[1] 0.8617310 0.9259459 0.9708987 0.8406503 1.0499755 1.1220594 1.1328310 1.1505176
# dispersions
dge <- estimateGLMCommonDisp(dge, dge$design)
dge <- estimateGLMTrendedDisp(dge, dge$design)
dge <- estimateGLMTagwiseDisp(dge, dge$design)

Modeling and testing with edgeR

fit <- glmFit(dge, dge$design)
print(dge$design)
   (Intercept) strainSnf
9            1         0
30           1         0
23           1         0
1            1         0
68           1         1
60           1         1
74           1         1
64           1         1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$strain
[1] "contr.treatment"
lrt <- glmLRT(fit)
res.edgeR <- topTags(lrt,n=nrow(dge$counts),adjust.method="BH")$table
print(head(res.edgeR))
            logFC   logCPM        LR        PValue           FDR
yor290c -8.191709 6.616968 1472.9374 2.964394e-322 2.112427e-318
ygr234w -4.212186 8.492503  765.4342 1.768090e-168 6.299706e-165
yhr215w -4.319150 7.242903  739.9788 6.058869e-163 1.439183e-159
yar071w -4.132219 7.587590  667.7889 3.023081e-147 5.385619e-144
yer011w -4.311996 7.617404  649.3290 3.126342e-143 4.455662e-140
yml123c -4.749141 9.302036  619.4639 9.780380e-137 1.161583e-133

Histogram of raw P-values

hist(res.edgeR$PValue, col="lightblue", main="Histogram of raw P-values (edgeR)", breaks=20, xlab="P-value")

Compare DESeq2 and edgeR results: normalization factors

plot(x=sizeFactors(dds), y=dge$samples$norm.factors, xlab="DESeq2 size factors", ylab="edgeR normalization factors", pch=16)

The normalization/size factors computed by DESeq2 and edgeR are not comparable as they are used in a different manner in the statistical/mathematical models.

Re-order the results according to the gene names

print(head(res.DESeq2))
log2 fold change (MAP): strain Snf vs WT 
Wald test p-value: strain Snf vs WT 
DataFrame with 6 rows and 6 columns
           baseMean log2FoldChange     lfcSE       stat     pvalue       padj
          <numeric>      <numeric> <numeric>  <numeric>  <numeric>  <numeric>
15s_rrna   7.185776    -0.75416098 0.4018401 -1.8767688 0.06054979 0.13454143
21s_rrna  51.015976    -0.82018629 0.3301773 -2.4840785 0.01298872 0.03845082
hra1       2.645990    -0.85128415 0.4361491 -1.9518190 0.05095969 0.11715733
icr1     144.123121     0.16208004 0.1394587  1.1622085 0.24515079 0.38638024
lsr1     159.760490    -0.42144241 0.2540594 -1.6588341 0.09714922 0.19277177
nme1      29.166050     0.09574963 0.2798070  0.3421988 0.73220128 0.82741283
print(head(res.edgeR))
            logFC   logCPM        LR        PValue           FDR
yor290c -8.191709 6.616968 1472.9374 2.964394e-322 2.112427e-318
ygr234w -4.212186 8.492503  765.4342 1.768090e-168 6.299706e-165
yhr215w -4.319150 7.242903  739.9788 6.058869e-163 1.439183e-159
yar071w -4.132219 7.587590  667.7889 3.023081e-147 5.385619e-144
yer011w -4.311996 7.617404  649.3290 3.126342e-143 4.455662e-140
yml123c -4.749141 9.302036  619.4639 9.780380e-137 1.161583e-133
res.edgeR <- res.edgeR[order(rownames(res.edgeR)),]
res.DESeq2 <- res.DESeq2[order(rownames(res.DESeq2)),]

Comparing log2(Fold-Change) estimations

plot(x=res.edgeR$logFC, y=res.DESeq2$log2FoldChange, 
     pch=16, cex=0.4, xlim=c(-2,2), ylim=c(-2,2),
     xlab="edgeR log2(Fold-Change)", ylab="DESeq2 log2(Fold-Change)")
abline(a=0, b=1, col="red", lwd=4) # draw the y=x curve (y=a+b*x with a=0 and b=1)
abline(h=0, v=0) # horizontal and vertical line

Comparing raw P-values

plot(x=res.edgeR$PValue, y=res.DESeq2$pvalue, 
     pch=16, cex=0.4, xlab="edgeR raw P-value", ylab="DESeq2 raw P-value")
abline(a=0, b=1, col="red", lwd=4) # draw the y=x curve (y=a+b*x with a=0 and b=1)

Number of differentially expressed genes

# remember the number of replicates
print(nb.replicates)
[1] 4
# DESeq2
sum(res.DESeq2$padj <= 0.05, na.rm=TRUE)
[1] 2329
# edgeR
sum(res.edgeR$FDR <= 0.05, na.rm=TRUE)
[1] 2180

What’s the behaviour of the number of differentially expressed genes according to the number of samples?

Venn diagram

library(gplots)
venn(list(DESeq2=rownames(res.DESeq2[which(res.DESeq2$padj <= 0.05),]), 
          edgeR=rownames(res.edgeR[which(res.edgeR$FDR <= 0.05),])))

Supplementary exercise: do the same plot for up- and down-regulated genes separately.

What’s edgeR or DESeq2-specific?

DESeq2.genes <- rownames(res.DESeq2[which(res.DESeq2$padj <= 0.05),])
edgeR.genes <- rownames(res.edgeR[which(res.edgeR$FDR <= 0.05),])
# select a DESeq2 specific gene
spe.DESeq2 <- setdiff(DESeq2.genes, edgeR.genes)
summary(res.edgeR[spe.DESeq2,"FDR"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.05015 0.05863 0.07199 0.08185 0.09347 0.30840 
# select a edgeR specific gene
spe.edgeR <- setdiff(edgeR.genes, DESeq2.genes)
summary(res.DESeq2[spe.edgeR,"padj"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
0.05001 0.05558 0.06183 0.07474 0.08369 0.20590       3 

DESeq2 results for one gene

# plotCounts is a function from the DESeq2 R package
plotCounts(dds, gene="ycr017c", intgroup="strain", normalized=TRUE)

Negative control

Question: how reliable are the genes declared “positive”, i.e. differentially expressed?

In principle, a good DEG detection program should return a negative answer when the dataset contains no differentially expressed gene.

Test: run DESeq2 and edgeR on a tricky dataset, built by selecting two subsets of replicates in the same genotype. Since the two sample sets are obtained in the same conditions, the analysis should return no DEG gene.

This is a negative control: we build an experiment whose result is expected to be negative.

Differential analysis under \(H_0\)

Here we perform a differential analysis in which we compare N WT samples vs N other WT samples.

nb.replicates <- 10
samples.WT <- sample(1:48, size=2*nb.replicates, replace=FALSE)
print(samples.WT)
 [1] 30 15 41 22 10 20 23 28 13 19 47  6  7  1 43  8 32 33  2 24
counts.H0 <- counts[,samples.WT]
expDesign.H0 <- expDesign[samples.WT,]
# add a fictive condition factor
expDesign.H0$condition <- factor(rep(c("A","B"), each=nb.replicates))
print(expDesign.H0)
   label strain condition
30  WT30     WT         A
15  WT15     WT         A
41  WT41     WT         A
22  WT22     WT         A
10  WT10     WT         A
20  WT20     WT         A
23  WT23     WT         A
28  WT28     WT         A
13  WT13     WT         A
19  WT19     WT         A
47  WT47     WT         B
6    WT6     WT         B
7    WT7     WT         B
1    WT1     WT         B
43  WT43     WT         B
8    WT8     WT         B
32  WT32     WT         B
33  WT33     WT         B
2    WT2     WT         B
24  WT24     WT         B
dds.H0 <- DESeqDataSetFromMatrix(countData = counts.H0, colData = expDesign.H0, design = ~ condition)

Differential analysis with DESeq2 under \(H_0\)

dds.H0 <- DESeq(dds.H0)
res.H0 <- results(dds.H0)
summary(res.H0)

out of 6870 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 0, 0% 
LFC < 0 (down)   : 0, 0% 
outliers [1]     : 0, 0% 
low counts [2]   : 0, 0% 
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

sessionInfo

Here are the details of the R packages used to generate this document:

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.2

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gplots_3.0.1               edgeR_3.14.0               limma_3.28.21              DESeq2_1.12.4              SummarizedExperiment_1.2.3 Biobase_2.32.0             GenomicRanges_1.24.3       GenomeInfoDb_1.8.7         IRanges_2.6.1              S4Vectors_0.10.3          
[11] BiocGenerics_0.18.0       

loaded via a namespace (and not attached):
 [1] genefilter_1.54.2    gtools_3.5.0         locfit_1.5-9.1       splines_3.3.2        lattice_0.20-34      colorspace_1.3-2     htmltools_0.3.5      yaml_2.1.14          survival_2.40-1      XML_3.98-1.5         foreign_0.8-67       DBI_0.5-1            BiocParallel_1.6.6   RColorBrewer_1.1-2  
[15] plyr_1.8.4           stringr_1.1.0        zlibbioc_1.18.0      munsell_0.4.3        gtable_0.2.0         caTools_1.17.1       evaluate_0.10        memoise_1.0.0        labeling_0.3         latticeExtra_0.6-28  knitr_1.15.1         geneplotter_1.50.0   AnnotationDbi_1.34.4 htmlTable_1.7       
[29] Rcpp_0.12.8          KernSmooth_2.23-15   acepack_1.4.1        xtable_1.8-2         openssl_0.9.5        scales_0.4.1         backports_1.0.4      gdata_2.17.0         base64_2.0           Hmisc_4.0-1          annotate_1.50.1      XVector_0.12.1       gridExtra_2.2.1      ggplot2_2.2.0       
[43] digest_0.6.10        stringi_1.1.2        grid_3.3.2           rprojroot_1.1        tools_3.3.2          bitops_1.0-6         magrittr_1.5         lazyeval_0.2.0       RCurl_1.95-4.8       tibble_1.2           RSQLite_1.1-1        Formula_1.2-1        cluster_2.0.5        Matrix_1.2-7.1      
[57] data.table_1.10.0    assertthat_0.1       rmarkdown_1.3        rpart_4.1-10         nnet_7.3-12