In this tutorial, we will get familiar with the different representations of transcription factor binding motifs (consensus, position-specific scoring matrix). For this, we will build motifs from a collection of TF binding sites, and evaluate the quality of these motifs, i.e. their capability to recover the relevant binding sites for the considered transcription factor.
We will proceed in the following way.
Symbol | Description | Name |
---|---|---|
IUPAC | IUPAC code for ambiguous nucleotides | [on RSAT] [on wikipedia] |
RSAT | Regulatory Sequence Analysis Tools | [http://rsat.eu/] |
RegulonDB | Database of transcriptional regulation in the bacteria Escherichia coli | [http://regulondb.ccg.unam.mx/] |
footprintDB | A database of transcription factors with annotated cis elements and binding interfaces | [http://floresta.eead.csic.es/footprintdb/] |
Tool | Usage |
---|---|
RSAT convert-matrix | Convert matrix between different formats +derive various statistics from a count matrix +generate sequence logos |
FootprintDB provides a convenient unified interface to consult a wide variety of transcription factor databases. We will use this interface to choose a test case for the subsequent exercises.
Open a connection to footprintDB (http://floresta.eead.csic.es/footprintdb/).
In the menu on the left side, click Databases.
Inspect the table of databases in order to get an idea of the diversity of data sources, organisms, and evaluate the number of transcription factors, motifs and binding sites for the different source databases.
In the row corresponding to RegulonDB, click on the link Transcription factors. At this stage, each student should select a transcription factor of interest, and write its name on the white board (to avoid duplicates). Note that the factor AraC will be used as study case by the teacher, please thus avoid to select the same factor.
Read the description of your TF of interest. Evaluate the number of annotated binding sites for this factor.
Note: for the next exercises we recommend to select a factor with a reasonable number of annotated sites (5 to 20). Factor with less than 5 binding sites are not suitable to build a reliable weight matrix. Factors with numerous binding sites are fine, but since there is a part of manual work in the subsequent steps, you might spend a lot of time on it.
Come back to the footprintDB Databases page. In the row corresponding to RegulonDB, right-click on the link DNA Binding Motifs.
Open the page of your transcription factor by clicking on its name in the column Accessions. At the bottom of the page, click Download and store the matrix in a text file on your computer.
Open a connection to RSAT teaching server (http://teaching.rsat.eu/). In the menu, open the matrix tools section and open the convert-matrix form. Paste the matrix downloaded from footprintDB, and select the appropriate input format (note that the footprintDB format is a slight variation on the transfac format).
Compute the counts, frequencies, weights, consensus, parameters and logo for your matrix of interest, using an organism-specific background model estimated on all the non-coding upstream sequences from Escherichia coli K12 mg1655.
Inspect carefully the results, with a particular attention to the parameters. Compare the strict and degenerate consensus, evaluate the maximal and minimal weight, and the information content of the matrix.
In the Search box, enter the name of the transcription factor of your choice (the same as in the above exercise with footprintDB), and open the record of the corresponding regulon.
Each student (or student group) should choose a specific motif. Individual results will be collected in a shared documents for the sake of comparison and interpretation. We recommend to choose a regulon containing a reasonable number of sites (e.g. from 5 to 20).
Collect all the aligned TFBS (transcription factor binding sites).
Note: by default, the regulon records of RegulonDB only display 10 sites. In order to collect all sites, increase the number of shown entries (under the title “Aligned TFBs”).
Open a connection to RSAT teaching server (http://www.rsat.eu/), and use the tool convert sequences to convert the site sequences in fasta format.
RegulonDB displays site sequences in a simple format, where each sequence comes on a separate line, without any header line or identifier. We need to convert these sequences to fasta, the most widely used sequence format. For this, set the input format to multi and the output format to fasta. Store the result in a text file on your computer.
Come back to the RegulonDB record of your regulon. Copy the matrix below the title Position-weight matrix (PWM) and store it in a separate text file with extension .tab (this is the classical extension for tab-delimited text file).
On the RSAT Web site.
The IUPAC consensus is very popular in the biological litteratur. In this tutorial, we will seee that this representation, albeit convenient for human communication, generally gives very poor results to predict binding sites for a given transcription factor. For this, we will generate a consensus for some well-annotated transcription factor, and evaluate the capability of this consensus to recover the actual binding sites, as well as the total number of sites matching that consensus in the set of all promoters (for bacteria, we will onsider the upstream non-coding sequences up to the first neighbour gene, with a maximal length of 400bp).
Using the IUPAC code for ambiguous nucleotides, draw an “ad hoc” consensus by applying the following (quite arbitrary) rules to each column of the position weight matrix
Compute manually the probability to find this consensus at a random position of a sequence, assuming that nucleotides are equiprobable.
Use the RSAT tool sequence probability to compute the background probability of the first site using three alternative background models:
Using the tool dna-pattern, scan the annotated sites (the fasta file you generated above) with the ad hoc consensus, and count the number of recovered sites.
Scan the annotated site sequences with the degenerate consensus produced by convert-matrix, and count the number of recovered sites.
Note: The option match count table enables to directly obtain the count of occurrences for one or more patterns in each input sequence.
We will now retrieve the non-coding sequences located upstream of all the genes from Escherichia coli strain K12 substrain MG1655.
Select the organism *Escherichia_coli_K_12_substr__MG1655_uid57779* in the pop-up menu.
Trick: when you click on the Organism pop-up menu, the number of organisms displayed may be relatively large depending on the RSAT server (for example, there are several thousands of bacteria). You can type a subset of the species name in order to select the matching organisms. For example, typing “MG1655” will restrict the list of displayed species and allow you to easily select the desired one.
Besides Genes, click the radio button all. You thus don’t ned to fill up the Genes text area, since you want all the genes of E.coli.
Leave all other parameters unchanged and press GO.
After a few seconds the server will show you the retrieve-seq result page, with a link to the sequences (fasta file), and a Next Step table proposing you to send the resulting sequences as input to other RSAT tools for further analysis.
We will now scan the same sequences (annotated binding sites and collection of all promoters) with the position-specific scoring matrix obtained from RegulonDB. Do to so, we will need to define a crucial parameter: the threshold to consider or not a given sequence as a binding site.
The primary score returned by motif search programs is the so-called weight score. However, defining a threshold on this scrore is somewhat arbitrary, especially because it is easier to achieve a high score with a larger matrix than with a narrow one. The weight score shold thus always be relativized to the matrix width, and to other parameters.
One way to normalize the sequence scanning is to set the thresholds on p-values rather than on weight scores. The RSAT tool matrix-scan allows to directly specify a threshold on the p-value.
Before the actual scanning, we recommend to use the tool matrix-distrib, which computes a complete distribution of p-values for all the possible weight scores that can be computed with a given matrix.
Use the RSAT tool matrix distribution to compute the theoretical distribution of probability for your motif.
Given the size if the upstream sequence set, estimate the number of false positives that would be expected with different threshold values on the weight score: 0, 3, 5, 7, 8, 10.
Compute the number of sites and the sensitivity (fraction of sites recovered) achieved when scanning the annotated binding sites in different conditions:
Compute the same statistics with a randomized version of your matrix, generated with permute-matrix.
How many binding sites and target genes do you obtain when you scan the whole set of promoters in the same conditions, but with a permuted version of the matrix?
In the exercise above, you performed different tasks to evaluate the capability of a TF binding motif (consensus, matrix) to predict binding sites. This evaluation included two complementary parameters: the sensitivity (i.e. fraction of actual binding sites detected) and false posiive rate.
The program matrix-quality automatically runs a series of similar analyses to estimate the tradeoff between sensitivity and false posigive rate, by combining theoretical and empirical distributions of scores computed in various conditions:
Collect the numbers of binding sites predicted in all the conditions above, and compute the derived statistics when possible: FPR, Sensitivity (see concept definitions for a definition of these statistics).
Summarize the most relevant results in a short paragraph, and draw your conclusions about the reliability of the different versions of the motif (consensus, matrix) to predict binding sites.