Solution to the exercise - from TF binding sites to position-specific scoring matrices

Introduction
Site sequences
Alignment matrix
1. Count matrix
2. Raw frequencies (pseudo-weight = 0)
3. Corrected frequencies
4. Weight matrices, no pseudo-weight, equal priors
5. Weight matrix with equal priors and pseudo-weight=1
6. Unequal priors
7. Interpretation of the results.

Introduction

In this document we show the solution of the exercise “From TF binding sites to position-specific weight matrices” using the R language. We display both the code and the results. This document was generated with R markdown from_TFBS_to_PSSM_solutions.Rmd.

Site sequences

## Encode the sequences in a vector
sequences <- c( "AAAAACCG",
                "AAAACCGG",
                "AATTGGGG",
                "ATTGTGGG")
print(as.data.frame(sequences))

  sequences
1  AAAAACCG
2  AAAACCGG
3  AATTGGGG
4  ATTGTGGG

Alignment matrix

## Build an alignment matrix, with one row per site 
## and one column per aligned position
alignment <- t(as.data.frame(strsplit(sequences, "")))
colnames(alignment) <- 1:ncol(alignment)
row.names(alignment) <- paste(sep="_", "site", 1:nrow(alignment))

kable(alignment, align="c")

	1	2	3	4	5	6	7	8
site_1	A	A	A	A	A	C	C	G
site_2	A	A	A	A	C	C	G	G
site_3	A	A	T	T	G	G	G	G
site_4	A	T	T	G	T	G	G	G

1. Count matrix

Build a position-specific scoring matrix indicating the residue counts per position (count matrix).

## Compute a count matrix from the alignment table
counts <- data.frame()
residues <- c("A", "C", "G", "T")
for (r in residues) {
  counts <- rbind(counts, apply(alignment==r, 2, sum))
}
colnames(counts) <- 1:ncol(counts)
row.names(counts) <- residues

kable(counts, align="c")

	1	2	3	4	5	6	7	8
A	4	3	2	2	1	0	0	0
C	0	0	0	0	1	2	1	0
G	0	0	0	1	1	2	3	4
T	0	1	2	1	1	0	0	0

2. Raw frequencies (pseudo-weight = 0)

From this count matrix, derive a frequency matrix.

freq <- counts / apply(counts, 2, sum)
kable(freq, align="c")

	1	2	3	4	5	6	7	8
A	1	0.75	0.5	0.50	0.25	0.0	0.00	0
C	0	0.00	0.0	0.00	0.25	0.5	0.25	0
G	0	0.00	0.0	0.25	0.25	0.5	0.75	1
T	0	0.25	0.5	0.25	0.25	0.0	0.00	0

3. Corrected frequencies

Compute a pseudo-weight smoothed frequency matrix, with pseudo-weights of 1, 10 and 100, respectively.

Equal prior probabilities

In this exercise we use equiprobable nucleotides, prior residue frequencies are thus set to \(p=0.25\) for each residue.

equal.priors <- vector()
equal.priors[residues] <- 1/length(residues)

kable(data.frame(equal.priors), align="c")

	equal.priors
A	0.25
C	0.25
G	0.25
T	0.25

Pseudo-weight = 1

k <- 1 ## Arbitrary value for the pseudo-weight

## Compute corrected frequencies
corrected.freq <- (counts + k/length(residues)) / apply(counts + k*0.25, 2, sum)

kable(corrected.freq, align="c", digits=3)

	1	2	3	4	5	6	7	8
A	0.85	0.65	0.45	0.45	0.25	0.05	0.05	0.05
C	0.05	0.05	0.05	0.05	0.25	0.45	0.25	0.05
G	0.05	0.05	0.05	0.25	0.25	0.45	0.65	0.85
T	0.05	0.25	0.45	0.25	0.25	0.05	0.05	0.05

Pseudo-weight = 10

k <- 10 ## Arbitrary value for the pseudo-weight

## Compute corrected frequencies
corrected.freq <- (counts + k/length(residues)) / apply(counts + k*0.25, 2, sum)

kable(corrected.freq, align="c", digits=3)

	1	2	3	4	5	6	7	8
A	0.464	0.393	0.321	0.321	0.25	0.179	0.179	0.179
C	0.179	0.179	0.179	0.179	0.25	0.321	0.250	0.179
G	0.179	0.179	0.179	0.250	0.25	0.321	0.393	0.464
T	0.179	0.250	0.321	0.250	0.25	0.179	0.179	0.179

Pseudo-weight = 100

k <- 100 ## Arbitrary value for the pseudo-weight

## Compute corrected frequencies
corrected.freq <- (counts + k/length(residues)) / apply(counts + k*0.25, 2, sum)

kable(corrected.freq, align="c", digits=3)

	1	2	3	4	5	6	7	8
A	0.279	0.269	0.26	0.26	0.25	0.24	0.240	0.240
C	0.240	0.240	0.24	0.24	0.25	0.26	0.250	0.240
G	0.240	0.240	0.24	0.25	0.25	0.26	0.269	0.279
T	0.240	0.250	0.26	0.25	0.25	0.24	0.240	0.240

4. Weight matrices, no pseudo-weight, equal priors

Compute the weight matrices corresponding to the frequency matrices with no pseudo-weight, assuming equiprobable residues.

## Compute weights
weights <- log(freq / 0.25)

kable(weights, align="c", digits=2)

	1	2	3	4	6	7	8
A	1.39	1.1	0.69	0.69	-Inf	-Inf	-Inf
C	-Inf	-Inf	-Inf	-Inf	0.69	0.0	-Inf
G	-Inf	-Inf	-Inf	0.00	0.69	1.1	1.39
T	-Inf	0.0	0.69	0.00	-Inf	-Inf	-Inf

5. Weight matrix with equal priors and pseudo-weight=1

k <- 1 ## Arbitrary value for the pseudo-weight

## Compute corrected frequencies
corrected.freq <- (counts + k/length(residues)) / apply(counts + k*0.25, 2, sum)
weights <- log(corrected.freq / 0.25)

kable(weights, align="c", digits=2)

	1	2	3	4	6	7	8
A	1.22	0.96	0.59	0.59	-1.61	-1.61	-1.61
C	-1.61	-1.61	-1.61	-1.61	0.59	0.00	-1.61
G	-1.61	-1.61	-1.61	0.00	0.59	0.96	1.22
T	-1.61	0.00	0.59	0.00	-1.61	-1.61	-1.61

6. Unequal priors

Compute the weight matrices with a pseudo-weight=1 and the following prior probabilities: A=0.3, T=0.3, C=0.2, G=0.2.

## Define nucleotide-specific residue priors
unequal.priors <- c("A"=0.3, "C"=0.2, "G"=0.2, "T"=0.3)
kable(data.frame(unequal.priors))

	unequal.priors
A	0.3
C	0.2
G	0.2
T	0.3

## Arbitrary value for the pseudo-weight
k <- 1 

## Compute corrected frequencies
corrected.freq <- (counts + k*unequal.priors) / apply(counts + k*unequal.priors, 2, sum)
kable(corrected.freq, align="c", digits=2, 
      caption=paste("Frequency matrix with unequal priors and pseudo-weight=",k))

Frequency matrix with unequal priors and pseudo-weight= 1
	1	2	3	4	5	6	7	8
A	0.86	0.66	0.46	0.46	0.26	0.06	0.06	0.06
C	0.04	0.04	0.04	0.04	0.24	0.44	0.24	0.04
G	0.04	0.04	0.04	0.24	0.24	0.44	0.64	0.84
T	0.06	0.26	0.46	0.26	0.26	0.06	0.06	0.06

weights <- log(corrected.freq / unequal.priors)
kable(weights, align="c", digits=2, caption = paste("Weight matrix with unequal priors and pseudo-weight =",k))

Weight matrix with unequal priors and pseudo-weight = 1
	1	2	3	4	5	6	7	8
A	1.05	0.79	0.43	0.43	-0.14	-1.61	-1.61	-1.61
C	-1.61	-1.61	-1.61	-1.61	0.18	0.79	0.18	-1.61
G	-1.61	-1.61	-1.61	0.18	0.18	0.79	1.16	1.44
T	-1.61	-0.14	0.43	-0.14	-0.14	-1.61	-1.61	-1.61