Solutions to the exercises

In this file we provide the solutions of the practical Scanning non-coding sequences with a TFBM with command-line use of the RSAT software suite.

Reference genome

Collective table for the 2020 practical

Students will store their results in a shared spreadsheet, which will be used to compare their results and get a broader landscape from the comparison of the results obtained with different transcription factors.

In your computer, create a folder to store the results of this practical, for example: $HOME/LCG_BEII_practicals/ (you can change the path and name according to your own organisation of folders).

Choosing a TF on RegulonDB

For this exercise, I chose the transcription factor AraC.

I define this in an environment variable. I also define and create a specific directory for the results related to this transcription factor.

I use the REST Web services to automatically gather the annotations from RegulonDB. REST Web services enable to invoke remotely a resource (database, software tool) by composing an URL with an entry point (which specifies the type of query) and a set of parameters separated by &.

For example, the list of genes regulated by AraC can be gound at the following URL.

http://regulondb.ccg.unam.mx/webresources/regulon/getRegulatedGenes?tfObject=AraC

They can then be stored in a file with a web aspirator, such as curlor wget.

Computing the degenerate consensus from the reference matrix

The degenerate consensus can be computed with convert-matrix with the appropriate parameters. Since it is printed as a comment (rows starging with ;) we can extract its actual value with grep and cut.

Coverage of the annotated binding sites by the reference motif

Choosing a background model for matrix-scan

To scan sequences with a matrix, we need to specify a background model. We can either compute it from the input sequences themselves (option -bginput) or specify a predefined background model file (option -bg_file).

Pre-computed background models are available in RSAT for each organism, and with different parameters: - oligonucleotides or dyads, - k-mer length, - frequencies counted on a single or on both strand, - accept or not self-overlaps for periodic patterns.

Compare the coverage rate of the two motifs

Binding site prediction in all promoters

  • Use the same tools (dna-pattern and matrix-scan) to predict binding sites in all the promoters of E.coli.

  • For matrix-scan, run the analysis with a threshold of p-value of either 0.001 or 0.0001.

  • Compare the number of matches obtained in these respective searches.

  • With the respective p-values used for the scanning, how many matches would you expect by chance ?

Negative control 1: scan artificial sequences with your motif

  • RSAT random sequences

Negative control 2: permute the columns of the matrix

  • Use the tool permute-matrix in order to generate 10 randomized copies of the motif

  • Send these randomiazed matrices to convert-matrix and check their logo.

  • Run the same analyses as above with the randomized matrix

  • Compare the number of sites obtained between the RegulonDB matrix and the randomized matrix derived from it.