Scanning non-coding sequences with a TFBM

Introduction

The goal of this practical is to evaluate the respective performances of two modes of representations for transcription factor binding motifs (TFBMs) to predict transcription factor binding sites (TFBS).

Parameter	Value
Reference genome	`Escherichia_coli_GCF_001308065.1_ASM130806v1`

Collective table for the 2020 practical

Students will store their results in a shared spreadsheet, which will be used to compare their results and get a broader landscape from the comparison of the results obtained with different transcription factors.

Folder: https://tinyurl.com/lcg-beii-2020
Motif scanning exercise:

In your computer, create a folder to store the results of this practical, for example : $HOME/LCG_BEII_practicals/ (you can change the path and name according to your own organisation of folders).

Choosing a TF on RegulonDB

Open a connection to RegulonDB http://regulondb.ccg.unam.mx/
Click on the link regulon list. This opens a table with all the regulons.
Choose a TF of interest and open its record
Fill up the details of the collective exploration table (https://tinyurl.com/lcg-beii-19).
Save a text file with the target gene names (one per row)
Save another text file with the names of the operon leader genes (one gene per row). These will serve as reference to compute the rate of recovery of the target genes with the different motif representations (consensus or matrix, resp. ).
Save a fasta file with the sequences of the known binding sites for your TF (tip: click on the bug “+” button in the header of the binding site section)
Save in a text file the matrix associated to your factor.

Computing the degenerate consensus from the reference matrix

Connect RSAT server: http://rsat.eu/
Choose the bacterial server
Use convert-matrix to compute frequencies, weights, parameters and display a logo of your matrix.
In the result, get the degenerated consensus and save it to a separate text file.

Getting all upstream (“promoter”) sequences of E.coli

Open the tool retrieve-seq
Select organism Escherichia coli K12 (top : type simply K12 in the organism query box)
Set all parameters to get the non-coding sequences located upstream of all genes with a maximal distance of 400 bp from the gene start
Copy the URL of the result file and save it in a text file (we will use it several times below)

Coverage of the annotated binding sites by the reference motif

Use dna-pattern to scan the annotated binding sites (extracted from RegulonDB) with the degenerate consensus.
Use matrix-scan to scan the same sites with the RegulonDB matrix
Compare the coverage rate of the two motifs

Binding site prediction in all promoters

Use the same tools (dna-pattern and matrix-scan) to predict binding sites in all the promoters of E.coli.
For matrix-scan, run the analysis with a threshold of p-value of either 0.001 or 0.0001.
Compare the number of matches obtained in these respective searches.
With the respective p-values used for the scanning, how many matches would you expect by chance ?

Negative control 1: scan artificial sequences with your motif

RSAT random sequences, generated with a Markov model of order 1 whose parameters were estimated from upstream non-coding sequences of the genome of reference.

Negative control 2: permute the columns of the matrix

Use the tool permute-matrix in order to generate 10 randomized copies of the motif
Send these randomiazed matrices to convert-matrix and check their logo.
Run the same analyses as above with the randomized matrix
Compare the number of sites obtained between the RegulonDB matrix and the randomized matrix derived from it.