From TF binding sites to consensuses and matrices

Prerequisites
Tools used in this tutorial
Exercise - From count matrices to weight matrices
- Solutions for the exercise
Tutorial - matrix conversion
- Exercise: impact of the pseudo-weight

Prerequisites

This tutorial assumes that you already learned the theory in the following chapters.

Chapter	Slides
Transcriptional regulation - basic concepts	01.2_regulatory_sequences_intro.pdf
Position-specific scoring matrices	01.4.PSSM_theory.pdf

Tools used in this tutorial

Tool	Usage
RSAT convert-matrix	Convert matrix between different formats +derive various statistics from a count matrix +generate sequence logos

Exercise - From count matrices to weight matrices

Before using computer tools, we will start with an easy exercise on an artificial dataset, in order to get the principles of the procedure enabling to build position-specific scoring matrices from aligned binding sites.

Let us assume that we dispose of a collection of 4 binding sites for a given transcription factor, collected by individual footprint experiments.

 AAAAACCG
 AAAACCGG
 AATTGGGG
 ATTGTGGG

Note: the example is intently minimalist for this exercise. In practice, you should not even try to build matrices with such a poor collecting of bona fide binding sites.

Open an empty spreadsheet in Excel or LibreOffice.

Build a position-specific scoring matrix indicating the residue counts per position (count matrix).
From this count matrix, derive a frequency matrix.
Compute a pseudo-weight smoothed frequency matrix, with pseudo-wsights of 1, 10 and 100, respectively.
Compute the weight matrices corresponding to the frequency matrices with no pseudo-weight, assuming equiprobable residues.
Do the same with the pseudo-weight=1.
Compute the weight matrices with a pseudo-weight=1 and the following prior probabilities: A=0.3, T=0.3, C=0.2, G=0.2.
Add a comment on the spreadsheet with your interpretation of the results.

Solutions for the exercise

You can now check the solution in various formats:

Excel sheed with the formulae: [solutions_exo1_fake_matrix.xlsx]
Printable document from the Excel sheet: [solutions_exo1_fake_matrix.pdf].
Solution in R (R code + results): [from_TFBS_to_PSSM_solutions.html]
R markdown document to generate the solution in R: [from_TFBS_to_PSSM_solutions.Rmd]

Tutorial - matrix conversion

Open a connection to the Regulatory Sequence Analysis Tools teaching server (http://teaching.rsat.eu/).
In the left-side menu, open the tool set Matrix tools, and click on convert-matrix.
Paste the count matrix from previous exercise¹ in the Matrix box, and check that the input format is set to tab.
Set the pseudo-weights to 1, and check the option distributed in an equiprobable way.
Download on your computer the background frequency file with equal prior probabilities: prior_equal.tab.
Under Background model estimation, check the option Upload your own background file, select format oligo-analysis, click on the “Choose file” button, and locate your local copy of the file prior_equal.tab.
For the Output format, select tab.
Check the following Output fields: counts, frequencies, weights, info, header, margins, consensus, parameters, logo with error bars and small sample corrections.
Set the decimals to 2.
CLick GO.
Com back to the form, modify the pseudo weight, and check the effect on the sequence logo.
Repeat steps 5 to 10 with unequal priors from the file prior_unequal.tab.
Check that you obtain the same results as with your manual computations in the previous exercises.

Exercise: impact of the pseudo-weight

Play around with the matrix conversion tool by progressively increasing the pseudo-weight (this exercise should be done with the unequal prior file as background).

What is the effect of this parameter on the frequency matrix and the weight matrix, respectively?
What happens when the pseudo-weights tends towards infinity (try with a sufficiently large value, e.g. 10000)?

The count matrix can also be found here: fake_matrix.tab ↩