Introduction

In this document we show the solution of the exercise “From TF binding sites to position-specific weight matrices” using the R language. We display both the code and the results. This document was generated with R markdown from_TFBS_to_PSSM_solutions.Rmd.

Site sequences

## Encode the sequences in a vector
sequences <- c( "AAAAACCG",
                "AAAACCGG",
                "AATTGGGG",
                "ATTGTGGG")
print(as.data.frame(sequences))
  sequences
1  AAAAACCG
2  AAAACCGG
3  AATTGGGG
4  ATTGTGGG

Alignment matrix

## Build an alignment matrix, with one row per site 
## and one column per aligned position
alignment <- t(as.data.frame(strsplit(sequences, "")))
colnames(alignment) <- 1:ncol(alignment)
row.names(alignment) <- paste(sep="_", "site", 1:nrow(alignment))

kable(alignment, align="c")
1 2 3 4 5 6 7 8
site_1 A A A A A C C G
site_2 A A A A C C G G
site_3 A A T T G G G G
site_4 A T T G T G G G

1. Count matrix

Build a position-specific scoring matrix indicating the residue counts per position (count matrix).

## Compute a count matrix from the alignment table
counts <- data.frame()
residues <- c("A", "C", "G", "T")
for (r in residues) {
  counts <- rbind(counts, apply(alignment==r, 2, sum))
}
colnames(counts) <- 1:ncol(counts)
row.names(counts) <- residues

kable(counts, align="c")
1 2 3 4 5 6 7 8
A 4 3 2 2 1 0 0 0
C 0 0 0 0 1 2 1 0
G 0 0 0 1 1 2 3 4
T 0 1 2 1 1 0 0 0

2. Raw frequencies (pseudo-weight = 0)

From this count matrix, derive a frequency matrix.

freq <- counts / apply(counts, 2, sum)
kable(freq, align="c")
1 2 3 4 5 6 7 8
A 1 0.75 0.5 0.50 0.25 0.0 0.00 0
C 0 0.00 0.0 0.00 0.25 0.5 0.25 0
G 0 0.00 0.0 0.25 0.25 0.5 0.75 1
T 0 0.25 0.5 0.25 0.25 0.0 0.00 0

3. Corrected frequencies

Compute a pseudo-weight smoothed frequency matrix, with pseudo-weights of 1, 10 and 100, respectively.

Equal prior probabilities

In this exercise we use equiprobable nucleotides, prior residue frequencies are thus set to \(p=0.25\) for each residue.

equal.priors <- vector()
equal.priors[residues] <- 1/length(residues)

kable(data.frame(equal.priors), align="c")
equal.priors
A 0.25
C 0.25
G 0.25
T 0.25

Pseudo-weight = 1

k <- 1 ## Arbitrary value for the pseudo-weight

## Compute corrected frequencies
corrected.freq <- (counts + k/length(residues)) / apply(counts + k*0.25, 2, sum)

kable(corrected.freq, align="c", digits=3)
1 2 3 4 5 6 7 8
A 0.85 0.65 0.45 0.45 0.25 0.05 0.05 0.05
C 0.05 0.05 0.05 0.05 0.25 0.45 0.25 0.05
G 0.05 0.05 0.05 0.25 0.25 0.45 0.65 0.85
T 0.05 0.25 0.45 0.25 0.25 0.05 0.05 0.05

Pseudo-weight = 10

k <- 10 ## Arbitrary value for the pseudo-weight

## Compute corrected frequencies
corrected.freq <- (counts + k/length(residues)) / apply(counts + k*0.25, 2, sum)

kable(corrected.freq, align="c", digits=3)
1 2 3 4 5 6 7 8
A 0.464 0.393 0.321 0.321 0.25 0.179 0.179 0.179
C 0.179 0.179 0.179 0.179 0.25 0.321 0.250 0.179
G 0.179 0.179 0.179 0.250 0.25 0.321 0.393 0.464
T 0.179 0.250 0.321 0.250 0.25 0.179 0.179 0.179

Pseudo-weight = 100

k <- 100 ## Arbitrary value for the pseudo-weight

## Compute corrected frequencies
corrected.freq <- (counts + k/length(residues)) / apply(counts + k*0.25, 2, sum)

kable(corrected.freq, align="c", digits=3)
1 2 3 4 5 6 7 8
A 0.279 0.269 0.26 0.26 0.25 0.24 0.240 0.240
C 0.240 0.240 0.24 0.24 0.25 0.26 0.250 0.240
G 0.240 0.240 0.24 0.25 0.25 0.26 0.269 0.279
T 0.240 0.250 0.26 0.25 0.25 0.24 0.240 0.240

4. Weight matrices, no pseudo-weight, equal priors

Compute the weight matrices corresponding to the frequency matrices with no pseudo-weight, assuming equiprobable residues.

## Compute weights
weights <- log(freq / 0.25)

kable(weights, align="c", digits=2)
1 2 3 4 5 6 7 8
A 1.39 1.1 0.69 0.69 0 -Inf -Inf -Inf
C -Inf -Inf -Inf -Inf 0 0.69 0.0 -Inf
G -Inf -Inf -Inf 0.00 0 0.69 1.1 1.39
T -Inf 0.0 0.69 0.00 0 -Inf -Inf -Inf

5. Weight matrix with equal priors and pseudo-weight=1

k <- 1 ## Arbitrary value for the pseudo-weight

## Compute corrected frequencies
corrected.freq <- (counts + k/length(residues)) / apply(counts + k*0.25, 2, sum)
weights <- log(corrected.freq / 0.25)

kable(weights, align="c", digits=2)
1 2 3 4 5 6 7 8
A 1.22 0.96 0.59 0.59 0 -1.61 -1.61 -1.61
C -1.61 -1.61 -1.61 -1.61 0 0.59 0.00 -1.61
G -1.61 -1.61 -1.61 0.00 0 0.59 0.96 1.22
T -1.61 0.00 0.59 0.00 0 -1.61 -1.61 -1.61

6. Unequal priors

Compute the weight matrices with a pseudo-weight=1 and the following prior probabilities: A=0.3, T=0.3, C=0.2, G=0.2.

## Define nucleotide-specific residue priors
unequal.priors <- c("A"=0.3, "C"=0.2, "G"=0.2, "T"=0.3)
kable(data.frame(unequal.priors))
unequal.priors
A 0.3
C 0.2
G 0.2
T 0.3
## Arbitrary value for the pseudo-weight
k <- 1 

## Compute corrected frequencies
corrected.freq <- (counts + k*unequal.priors) / apply(counts + k*unequal.priors, 2, sum)
kable(corrected.freq, align="c", digits=2, 
      caption=paste("Frequency matrix with unequal priors and pseudo-weight=",k))
Frequency matrix with unequal priors and pseudo-weight= 1
1 2 3 4 5 6 7 8
A 0.86 0.66 0.46 0.46 0.26 0.06 0.06 0.06
C 0.04 0.04 0.04 0.04 0.24 0.44 0.24 0.04
G 0.04 0.04 0.04 0.24 0.24 0.44 0.64 0.84
T 0.06 0.26 0.46 0.26 0.26 0.06 0.06 0.06
weights <- log(corrected.freq / unequal.priors)
kable(weights, align="c", digits=2, caption = paste("Weight matrix with unequal priors and pseudo-weight =",k))
Weight matrix with unequal priors and pseudo-weight = 1
1 2 3 4 5 6 7 8
A 1.05 0.79 0.43 0.43 -0.14 -1.61 -1.61 -1.61
C -1.61 -1.61 -1.61 -1.61 0.18 0.79 0.18 -1.61
G -1.61 -1.61 -1.61 0.18 0.18 0.79 1.16 1.44
T -1.61 -0.14 0.43 -0.14 -0.14 -1.61 -1.61 -1.61

7. Interpretation of the results.