Introduction
In this document we show the solution of the exercise “From TF binding sites to position-specific weight matrices” using the R language. We display both the code and the results. This document was generated with R markdown from_TFBS_to_PSSM_solutions.Rmd.
Site sequences
## Encode the sequences in a vector
sequences <- c( "AAAAACCG",
"AAAACCGG",
"AATTGGGG",
"ATTGTGGG")
print(as.data.frame(sequences))
sequences
1 AAAAACCG
2 AAAACCGG
3 AATTGGGG
4 ATTGTGGG
Alignment matrix
## Build an alignment matrix, with one row per site
## and one column per aligned position
alignment <- t(as.data.frame(strsplit(sequences, "")))
colnames(alignment) <- 1:ncol(alignment)
row.names(alignment) <- paste(sep="_", "site", 1:nrow(alignment))
kable(alignment, align="c")
site_1 |
A |
A |
A |
A |
A |
C |
C |
G |
site_2 |
A |
A |
A |
A |
C |
C |
G |
G |
site_3 |
A |
A |
T |
T |
G |
G |
G |
G |
site_4 |
A |
T |
T |
G |
T |
G |
G |
G |
1. Count matrix
Build a position-specific scoring matrix indicating the residue counts per position (count matrix).
## Compute a count matrix from the alignment table
counts <- data.frame()
residues <- c("A", "C", "G", "T")
for (r in residues) {
counts <- rbind(counts, apply(alignment==r, 2, sum))
}
colnames(counts) <- 1:ncol(counts)
row.names(counts) <- residues
kable(counts, align="c")
A |
4 |
3 |
2 |
2 |
1 |
0 |
0 |
0 |
C |
0 |
0 |
0 |
0 |
1 |
2 |
1 |
0 |
G |
0 |
0 |
0 |
1 |
1 |
2 |
3 |
4 |
T |
0 |
1 |
2 |
1 |
1 |
0 |
0 |
0 |
2. Raw frequencies (pseudo-weight = 0)
From this count matrix, derive a frequency matrix.
freq <- counts / apply(counts, 2, sum)
kable(freq, align="c")
A |
1 |
0.75 |
0.5 |
0.50 |
0.25 |
0.0 |
0.00 |
0 |
C |
0 |
0.00 |
0.0 |
0.00 |
0.25 |
0.5 |
0.25 |
0 |
G |
0 |
0.00 |
0.0 |
0.25 |
0.25 |
0.5 |
0.75 |
1 |
T |
0 |
0.25 |
0.5 |
0.25 |
0.25 |
0.0 |
0.00 |
0 |
3. Corrected frequencies
Compute a pseudo-weight smoothed frequency matrix, with pseudo-weights of 1, 10 and 100, respectively.
Equal prior probabilities
In this exercise we use equiprobable nucleotides, prior residue frequencies are thus set to \(p=0.25\) for each residue.
equal.priors <- vector()
equal.priors[residues] <- 1/length(residues)
kable(data.frame(equal.priors), align="c")
A |
0.25 |
C |
0.25 |
G |
0.25 |
T |
0.25 |
Pseudo-weight = 1
k <- 1 ## Arbitrary value for the pseudo-weight
## Compute corrected frequencies
corrected.freq <- (counts + k/length(residues)) / apply(counts + k*0.25, 2, sum)
kable(corrected.freq, align="c", digits=3)
A |
0.85 |
0.65 |
0.45 |
0.45 |
0.25 |
0.05 |
0.05 |
0.05 |
C |
0.05 |
0.05 |
0.05 |
0.05 |
0.25 |
0.45 |
0.25 |
0.05 |
G |
0.05 |
0.05 |
0.05 |
0.25 |
0.25 |
0.45 |
0.65 |
0.85 |
T |
0.05 |
0.25 |
0.45 |
0.25 |
0.25 |
0.05 |
0.05 |
0.05 |
Pseudo-weight = 10
k <- 10 ## Arbitrary value for the pseudo-weight
## Compute corrected frequencies
corrected.freq <- (counts + k/length(residues)) / apply(counts + k*0.25, 2, sum)
kable(corrected.freq, align="c", digits=3)
A |
0.464 |
0.393 |
0.321 |
0.321 |
0.25 |
0.179 |
0.179 |
0.179 |
C |
0.179 |
0.179 |
0.179 |
0.179 |
0.25 |
0.321 |
0.250 |
0.179 |
G |
0.179 |
0.179 |
0.179 |
0.250 |
0.25 |
0.321 |
0.393 |
0.464 |
T |
0.179 |
0.250 |
0.321 |
0.250 |
0.25 |
0.179 |
0.179 |
0.179 |
Pseudo-weight = 100
k <- 100 ## Arbitrary value for the pseudo-weight
## Compute corrected frequencies
corrected.freq <- (counts + k/length(residues)) / apply(counts + k*0.25, 2, sum)
kable(corrected.freq, align="c", digits=3)
A |
0.279 |
0.269 |
0.26 |
0.26 |
0.25 |
0.24 |
0.240 |
0.240 |
C |
0.240 |
0.240 |
0.24 |
0.24 |
0.25 |
0.26 |
0.250 |
0.240 |
G |
0.240 |
0.240 |
0.24 |
0.25 |
0.25 |
0.26 |
0.269 |
0.279 |
T |
0.240 |
0.250 |
0.26 |
0.25 |
0.25 |
0.24 |
0.240 |
0.240 |
4. Weight matrices, no pseudo-weight, equal priors
Compute the weight matrices corresponding to the frequency matrices with no pseudo-weight, assuming equiprobable residues.
## Compute weights
weights <- log(freq / 0.25)
kable(weights, align="c", digits=2)
A |
1.39 |
1.1 |
0.69 |
0.69 |
0 |
-Inf |
-Inf |
-Inf |
C |
-Inf |
-Inf |
-Inf |
-Inf |
0 |
0.69 |
0.0 |
-Inf |
G |
-Inf |
-Inf |
-Inf |
0.00 |
0 |
0.69 |
1.1 |
1.39 |
T |
-Inf |
0.0 |
0.69 |
0.00 |
0 |
-Inf |
-Inf |
-Inf |
5. Weight matrix with equal priors and pseudo-weight=1
k <- 1 ## Arbitrary value for the pseudo-weight
## Compute corrected frequencies
corrected.freq <- (counts + k/length(residues)) / apply(counts + k*0.25, 2, sum)
weights <- log(corrected.freq / 0.25)
kable(weights, align="c", digits=2)
A |
1.22 |
0.96 |
0.59 |
0.59 |
0 |
-1.61 |
-1.61 |
-1.61 |
C |
-1.61 |
-1.61 |
-1.61 |
-1.61 |
0 |
0.59 |
0.00 |
-1.61 |
G |
-1.61 |
-1.61 |
-1.61 |
0.00 |
0 |
0.59 |
0.96 |
1.22 |
T |
-1.61 |
0.00 |
0.59 |
0.00 |
0 |
-1.61 |
-1.61 |
-1.61 |
6. Unequal priors
Compute the weight matrices with a pseudo-weight=1 and the following prior probabilities: A=0.3, T=0.3, C=0.2, G=0.2.
## Define nucleotide-specific residue priors
unequal.priors <- c("A"=0.3, "C"=0.2, "G"=0.2, "T"=0.3)
kable(data.frame(unequal.priors))
## Arbitrary value for the pseudo-weight
k <- 1
## Compute corrected frequencies
corrected.freq <- (counts + k*unequal.priors) / apply(counts + k*unequal.priors, 2, sum)
kable(corrected.freq, align="c", digits=2,
caption=paste("Frequency matrix with unequal priors and pseudo-weight=",k))
Frequency matrix with unequal priors and pseudo-weight= 1
A |
0.86 |
0.66 |
0.46 |
0.46 |
0.26 |
0.06 |
0.06 |
0.06 |
C |
0.04 |
0.04 |
0.04 |
0.04 |
0.24 |
0.44 |
0.24 |
0.04 |
G |
0.04 |
0.04 |
0.04 |
0.24 |
0.24 |
0.44 |
0.64 |
0.84 |
T |
0.06 |
0.26 |
0.46 |
0.26 |
0.26 |
0.06 |
0.06 |
0.06 |
weights <- log(corrected.freq / unequal.priors)
kable(weights, align="c", digits=2, caption = paste("Weight matrix with unequal priors and pseudo-weight =",k))
Weight matrix with unequal priors and pseudo-weight = 1
A |
1.05 |
0.79 |
0.43 |
0.43 |
-0.14 |
-1.61 |
-1.61 |
-1.61 |
C |
-1.61 |
-1.61 |
-1.61 |
-1.61 |
0.18 |
0.79 |
0.18 |
-1.61 |
G |
-1.61 |
-1.61 |
-1.61 |
0.18 |
0.18 |
0.79 |
1.16 |
1.44 |
T |
-1.61 |
-0.14 |
0.43 |
-0.14 |
-0.14 |
-1.61 |
-1.61 |
-1.61 |
7. Interpretation of the results.