This tutorial assumes that you already learned the theory in the following chapters.

Chapter | Slides |
---|---|

Transcriptional regulation - basic concepts | 01.2_regulatory_sequences_intro.pdf |

Position-specific scoring matrices | 01.4.PSSM_theory.pdf |

Tool | Usage |
---|---|

RSAT convert-matrix | Convert matrix between different formats +derive various statistics from a count matrix +generate sequence logos |

Before using computer tools, we will start with an easy exercise on an artificial dataset, in order to get the principles of the procedure enabling to build position-specific scoring matrices from aligned binding sites.

Let us assume that we dispose of a collection of 4 binding sites for a given transcription factor, collected by individual footprint experiments.

```
AAAAACCG
AAAACCGG
AATTGGGG
ATTGTGGG
```

**Note**: the example is intently minimalist for this exercise. In practice, you should not even try to build matrices with such a poor collecting of *bona fide* binding sites.

Open an empty spreadsheet in Excel or LibreOffice.

- Build a position-specific scoring matrix indicating the residue counts per position (
**count matrix**). - From this count matrix, derive a
**frequency matrix**. - Compute a
**pseudo-weight smoothed frequency matrix**, with pseudo-wsights of 1, 10 and 100, respectively. - Compute the weight matrices corresponding to the frequency matrices with no pseudo-weight, assuming equiprobable residues.
- Do the same with the pseudo-weight=1.
- Compute the weight matrices with a pseudo-weight=1 and the following prior probabilities: A=0.3, T=0.3, C=0.2, G=0.2.
- Add a comment on the spreadsheet with your interpretation of the results.

You can now check the solution in various formats:

- Excel sheed with the formulae: [solutions_exo1_fake_matrix.xlsx]
- Printable document from the Excel sheet: [solutions_exo1_fake_matrix.pdf].
- Solution in R (R code + results): [from_TFBS_to_PSSM_solutions.html]
- R markdown document to generate the solution in R: [from_TFBS_to_PSSM_solutions.Rmd]

- Open a connection to the
**Regulatory Sequence Analysis Tools**teaching server (http://teaching.rsat.eu/). - In the left-side menu, open the tool set
**Matrix tools**, and click on**convert-matrix**. - Paste the count matrix from previous exercise
^{1}in the**Matrix**box, and check that the input format is set to**tab**. - Set the
**pseudo-weights**to*1*, and check the option**distributed in an equiprobable way**. - Download on your computer the background frequency file with equal prior probabilities: prior_equal.tab.
- Under
**Background model estimation**, check the option*Upload your own background file*, select*format oligo-analysis*, click on the “Choose file” button, and locate your local copy of the file*prior_equal.tab*. - For the
**Output format**, select*tab*. - Check the following
**Output fields**: counts, frequencies, weights, info, header, margins, consensus, parameters,*logo*with*error bars*and*small sample corrections*. - Set the
**decimals**to 2. - CLick
**GO**. - Com back to the form, modify the pseudo weight, and check the effect on the sequence logo.
- Repeat steps 5 to 10 with unequal priors from the file prior_unequal.tab.
- Check that you obtain the same results as with your manual computations in the previous exercises.

Play around with the matrix conversion tool by progressively increasing the pseudo-weight (this exercise should be done with the unequal prior file as background).

- What is the effect of this parameter on the frequency matrix and the weight matrix, respectively?
- What happens when the pseudo-weights tends towards infinity (try with a sufficiently large value, e.g. 10000)?

The count matrix can also be found here: fake_matrix.tab↩