Scanning sequences to predict transcription factor binding sites

Objectives
Resources
Tools used in this tutorial
Tutorial: collecting motifs
- Protocol : collecting motifs from footprintDB
- Protocol: collecting information from RegulonDB
Scanning sequences with IUPAC consensus
Scanning sequences with a position-specific scoring matrix
Evaluating the quality of a position-specific scoring matrix
Report

Objectives

In this tutorial, we will get familiar with the different representations of transcription factor binding motifs (consensus, position-specific scoring matrix). For this, we will build motifs from a collection of TF binding sites, and evaluate the quality of these motifs, i.e. their capability to recover the relevant binding sites for the considered transcription factor.

We will proceed in the following way.

Collect binding sites and annotated motif (position-specific scoring matrix) for a transcription factor of interest.
Produce an “ad hoc” consensus of the matrix, based on rules (ad hoc thresholds on residue frequencies).
Produce a consensus with the tool convert-matrix.
Estimate the generality of these consensus by counting the number of matches among the sites used to build them.
Estimate the sensitivity of these consensus by counting the number of operons containing at least one occurrence in the leading gene.

Resources

Symbol	Description	Name
IUPAC	IUPAC code for ambiguous nucleotides	[on RSAT] [on wikipedia]
RSAT	Regulatory Sequence Analysis Tools	[http://rsat.eu/]
RegulonDB	Database of transcriptional regulation in the bacteria Escherichia coli	[http://regulondb.ccg.unam.mx/]
footprintDB	A database of transcription factors with annotated cis elements and binding interfaces	[http://floresta.eead.csic.es/footprintdb/]

Tools used in this tutorial

Tool	Usage
RSAT convert-matrix	Convert matrix between different formats +derive various statistics from a count matrix +generate sequence logos

Tutorial: collecting motifs

Protocol : collecting motifs from footprintDB

FootprintDB provides a convenient unified interface to consult a wide variety of transcription factor databases. We will use this interface to choose a test case for the subsequent exercises.

Open a connection to footprintDB (http://floresta.eead.csic.es/footprintdb/).
In the menu on the left side, click Databases.
Inspect the table of databases in order to get an idea of the diversity of data sources, organisms, and evaluate the number of transcription factors, motifs and binding sites for the different source databases.
In the row corresponding to RegulonDB, click on the link Transcription factors. At this stage, each student should select a transcription factor of interest, and write its name on the white board (to avoid duplicates). Note that the factor AraC will be used as study case by the teacher, please thus avoid to select the same factor.
Read the description of your TF of interest. Evaluate the number of annotated binding sites for this factor.

Note: for the next exercises we recommend to select a factor with a reasonable number of annotated sites (5 to 20). Factor with less than 5 binding sites are not suitable to build a reliable weight matrix. Factors with numerous binding sites are fine, but since there is a part of manual work in the subsequent steps, you might spend a lot of time on it.

Come back to the footprintDB Databases page. In the row corresponding to RegulonDB, right-click on the link DNA Binding Motifs.
Open the page of your transcription factor by clicking on its name in the column Accessions. At the bottom of the page, click Download and store the matrix in a text file on your computer.
Open a connection to RSAT teaching server (http://teaching.rsat.eu/). In the menu, open the matrix tools section and open the convert-matrix form. Paste the matrix downloaded from footprintDB, and select the appropriate input format (note that the footprintDB format is a slight variation on the transfac format).
Compute the counts, frequencies, weights, consensus, parameters and logo for your matrix of interest, using an organism-specific background model estimated on all the non-coding upstream sequences from Escherichia coli K12 mg1655.

Inspect carefully the results, with a particular attention to the parameters. Compare the strict and degenerate consensus, evaluate the maximal and minimal weight, and the information content of the matrix.

Protocol: collecting information from RegulonDB

Open a connection to RegulonDB (http://regulondb.ccg.unam.mx/).
In the Search box, enter the name of the transcription factor of your choice (the same as in the above exercise with footprintDB), and open the record of the corresponding regulon.

Each student (or student group) should choose a specific motif. Individual results will be collected in a shared documents for the sake of comparison and interpretation. We recommend to choose a regulon containing a reasonable number of sites (e.g. from 5 to 20).
- If you have no idea about TF names, leave th equery box empty and click Search, to obtain a full list of operon names. You can then pick up some regulons at random, and check if they have a reasonable number of sites.
- If you still feel short of inspiration, you can choose a factor among the following ones.
  - FadR
  - TyrR
  - LexA
  - OmpR
  - …
Collect all the aligned TFBS (transcription factor binding sites).

Note: by default, the regulon records of RegulonDB only display 10 sites. In order to collect all sites, increase the number of shown entries (under the title “Aligned TFBs”).
Open a connection to RSAT teaching server (http://www.rsat.eu/), and use the tool convert sequences to convert the site sequences in fasta format.

RegulonDB displays site sequences in a simple format, where each sequence comes on a separate line, without any header line or identifier. We need to convert these sequences to fasta, the most widely used sequence format. For this, set the input format to multi and the output format to fasta. Store the result in a text file on your computer.
Come back to the RegulonDB record of your regulon. Copy the matrix below the title Position-weight matrix (PWM) and store it in a separate text file with extension .tab (this is the classical extension for tab-delimited text file).
On the RSAT Web site.
- Open the tool convert matrix, and paste the RegulonDB matrix in the Matrix box;
- Check that the input format is set to tab;
- Select an organism-specific background model estimated from all the non-coding upstream sequences (upstream-noorf) of the reference strain Escherichia coli K-12 substr MG1655 uid57779;
- In the output fields, check “frequencies”, “margins” and “parameters”, in addition to the pre-selected options.
- Set the Score decimals to 2.
- Click GO
- Inspect the logo, the strict and degenerate consensus. Comptare this with the results you obtained from the fake matrix in the tutorial “from sites to motifs”.

Scanning sequences with IUPAC consensus

The IUPAC consensus is very popular in the biological litteratur. In this tutorial, we will seee that this representation, albeit convenient for human communication, generally gives very poor results to predict binding sites for a given transcription factor. For this, we will generate a consensus for some well-annotated transcription factor, and evaluate the capability of this consensus to recover the actual binding sites, as well as the total number of sites matching that consensus in the set of all promoters (for bacteria, we will onsider the upstream non-coding sequences up to the first neighbour gene, with a maximal length of 400bp).

Computing a “naive” consensus

Using the IUPAC code for ambiguous nucleotides, draw an “ad hoc” consensus by applying the following (quite arbitrary) rules to each column of the position weight matrix

If a single letter covers >60% of the occurrences, use a single letter
else if the sum of occurrences of 2 letters cover >70% of the occurrences, use a two-letter code;
else if the sum of occurrences of 3 letters cover >90% of the occurrences, use a three-letter code;
else write “N”

Computing the probability of a degenerate consensus

Compute manually the probability to find this consensus at a random position of a sequence, assuming that nucleotides are equiprobable.
Use the RSAT tool sequence probability to compute the background probability of the first site using three alternative background models:
1. Bernoulli model (== Markov model of order 0) whose parameters were estimated on the set of non-coding upstream sequences of Escherichia coli.
2. Markov model of order 1 whose parameters were estimated on the set of non-coding upstream sequences of Escherichia coli.

Generating a consensus from weights

Use the RSAT tool convert matrix to generate the logo of your motif, using a Markov model estimated from the non-coding upstream sequences of Escherichia coli K-12 MG1655 (option upstream-noorf).

Consensus matches among the annotated binding sites

Using the tool dna-pattern, scan the annotated sites (the fasta file you generated above) with the ad hoc consensus, and count the number of recovered sites.
Scan the annotated site sequences with the degenerate consensus produced by convert-matrix, and count the number of recovered sites.

Note: The option match count table enables to directly obtain the count of occurrences for one or more patterns in each input sequence.

Retrieving all the E.coli promoter sequences

We will now retrieve the non-coding sequences located upstream of all the genes from Escherichia coli strain K12 substrain MG1655.
1. In the left menu of the RSAT Web site, open the Sequence tools box and click on retrieve sequences.
2. Select the organism *Escherichia_coli_K_12_substr__MG1655_uid57779* in the pop-up menu.
  
  Trick: when you click on the Organism pop-up menu, the number of organisms displayed may be relatively large depending on the RSAT server (for example, there are several thousands of bacteria). You can type a subset of the species name in order to select the matching organisms. For example, typing “MG1655” will restrict the list of displayed species and allow you to easily select the desired one.
3. Besides Genes, click the radio button all. You thus don’t ned to fill up the Genes text area, since you want all the genes of E.coli.
4. Leave all other parameters unchanged and press GO.

After a few seconds the server will show you the retrieve-seq result page, with a link to the sequences (fasta file), and a Next Step table proposing you to send the resulting sequences as input to other RSAT tools for further analysis.

Scanning all the promoters with a consensus

Use the tool retrieve-seq followed by the tool dna-pattern to scan all E.coli promoters (same strain as above) with the two consensus defined above (ad hoc and produced by convert-matrix), and estimate the number of recovered operons.

Scanning sequences with a position-specific scoring matrix

We will now scan the same sequences (annotated binding sites and collection of all promoters) with the position-specific scoring matrix obtained from RegulonDB. Do to so, we will need to define a crucial parameter: the threshold to consider or not a given sequence as a binding site.

The primary score returned by motif search programs is the so-called weight score. However, defining a threshold on this scrore is somewhat arbitrary, especially because it is easier to achieve a high score with a larger matrix than with a narrow one. The weight score shold thus always be relativized to the matrix width, and to other parameters.

One way to normalize the sequence scanning is to set the thresholds on p-values rather than on weight scores. The RSAT tool matrix-scan allows to directly specify a threshold on the p-value.

Before the actual scanning, we recommend to use the tool matrix-distrib, which computes a complete distribution of p-values for all the possible weight scores that can be computed with a given matrix.

Estimating the theoretical distribution of scores

Use the RSAT tool matrix distribution to compute the theoretical distribution of probability for your motif.
Given the size if the upstream sequence set, estimate the number of false positives that would be expected with different threshold values on the weight score: 0, 3, 5, 7, 8, 10.

Estimating the sensitivity of a position-specific scoring matrix

Compute the number of sites and the sensitivity (fraction of sites recovered) achieved when scanning the annotated binding sites in different conditions:
1. With the different consensuses
2. With the position-specific scoring matrix from RegulonDB (you can download it from footprintDB), using a p-value threshold of 0.001.
3. With the position-specific scoring matrix from RegulonDB (you can download it from footprintDB), using a p-value threshold of 0.0001.

Predicting binding sites in all promoters

How many binding sites and target genes do you obtain when you scan the whole set of promoters with p-value thresholds of respectively 0.001 and 0.0001.

Estimating the rate of false positives with a randomized motif

Compute the same statistics with a randomized version of your matrix, generated with permute-matrix.
How many binding sites and target genes do you obtain when you scan the whole set of promoters in the same conditions, but with a permuted version of the matrix?

Estimating the rate of false positives with random sequences

Use the tool random-seq to generate artificial sequences of the same length of the promoter set, and scan them with the TF binding motif, and the permuted matrix.

Evaluating the quality of a position-specific scoring matrix

In the exercise above, you performed different tasks to evaluate the capability of a TF binding motif (consensus, matrix) to predict binding sites. This evaluation included two complementary parameters: the sensitivity (i.e. fraction of actual binding sites detected) and false posiive rate.

The program matrix-quality automatically runs a series of similar analyses to estimate the tradeoff between sensitivity and false posigive rate, by combining theoretical and empirical distributions of scores computed in various conditions:

scanning a positive sequence set (e.g. annotated sites);
scanning a negative set (e.g. random sequences, or biological sequences containing no site);
scanning the same datasets with permuted matrices.

In the left-side menu of RSAT, open the tool matrix-quality.
Conceive a way to test the quality of the TF binding matrix obtained from RegulonDB, based on the previous exercises.

Report

Collect the numbers of binding sites predicted in all the conditions above, and compute the derived statistics when possible: FPR, Sensitivity (see concept definitions for a definition of these statistics).

Summarize the most relevant results in a short paragraph, and draw your conclusions about the reliability of the different versions of the motif (consensus, matrix) to predict binding sites.