Solutions - Scanning non-coding sequences with a TFBM

Solutions to the exercises

In this file we provide the solutions of the practical Scanning non-coding sequences with a TFBM with command-line use of the RSAT software suite.

Reference genome

Collective table for the 2020 practical

Students will store their results in a shared spreadsheet, which will be used to compare their results and get a broader landscape from the comparison of the results obtained with different transcription factors.

Folder: https://tinyurl.com/lcg-beii-2020
Motif scanning exercise:

In your computer, create a folder to store the results of this practical, for example: $HOME/LCG_BEII_practicals/ (you can change the path and name according to your own organisation of folders).

Choosing a TF on RegulonDB

For this exercise, I chose the transcription factor AraC.

I define this in an environment variable. I also define and create a specific directory for the results related to this transcription factor.

## Define the reference organism
export ORG=Escherichia_coli_GCF_000005845.2_ASM584v2

## Choose a transcription factor (TF) of interes
export TF=AraC
export RESULT_DIR=results/${TF}
mkdir -p ${RESULT_DIR}
cd ${RESULT_DIR}

I use the REST Web services to automatically gather the annotations from RegulonDB. REST Web services enable to invoke remotely a resource (database, software tool) by composing an URL with an entry point (which specifies the type of query) and a set of parameters separated by &.

For example, the list of genes regulated by AraC can be gound at the following URL.

http://regulondb.ccg.unam.mx/webresources/regulon/getRegulatedGenes?tfObject=AraC

They can then be stored in a file with a web aspirator, such as curlor wget.

## Get the annotated binding sites from RegulonDB
curl 'http://regulondb.ccg.unam.mx/webresources/tools/getTFBS?tfObject=AraC&extended=0' \
   > results/AraC/AraC_RegulonDB_sites_ext0.tsv

## Get the annotated position-specific scoring matrix from RegulonDB
curl 'http://regulondb.ccg.unam.mx/webresources/tools/getPSSM?tfObject=AraC' \
   > results/AraC/AraC_RegulonDB_PSSM.tab

## Get the annotated target genes from RegulonDB
curl 'http://regulondb.ccg.unam.mx/webresources/regulon/getRegulatedGenes?tfObject=AraC' \
    > results/AraC/AraC_RegulonDB_genes.tab

Computing the degenerate consensus from the reference matrix

The degenerate consensus can be computed with convert-matrix with the appropriate parameters. Since it is printed as a comment (rows starging with ;) we can extract its actual value with grep and cut.

convert-matrix -v  -i results/AraC/AraC_RegulonDB_PSSM.tab \
  -from tab -to tab -return consensus \
  | grep '^; consensus\t' \
  | cut -f 2 \
  > results/AraC/AraC_RegulonDB_consensus.tab

Getting all upstream (“promoter”) sequences of E.coli

## Define an environment variable with the file containing all upstream sequences
export ALLUP=results/AraC/Escherichia_coli_GCF_000005845.2_ASM584v2_all_upstream-noorf.fasta

## Retrieve all upstream sequences
retrieve-seq -org Escherichia_coli_GCF_000005845.2_ASM584v2 \
  -from -1 -to -400 -noorf -all -label name \
  -o ${ALLUP}
  
## Check the result (type "q" to quit the "less" command)
less ${ALLUP}

Coverage of the annotated binding sites by the reference motif

Use dna-pattern to scan the annotated binding sites with a consensus

## Scan annotated TFBSs with degenerate consensus
dna-pattern -v 1 \
  -i ${ALLUP} \
  -pl results/AraC/AraC_RegulonDB_consensus.tab \
  -o results/AraC/TFBS_matches_with_deg-consensus_AraC.ft

## Check the result
less results/AraC/TFBS_matches_with_deg-consensus_AraC.ft

Choosing a background model for matrix-scan

To scan sequences with a matrix, we need to specify a background model. We can either compute it from the input sequences themselves (option -bginput) or specify a predefined background model file (option -bg_file).

Pre-computed background models are available in RSAT for each organism, and with different parameters: - oligonucleotides or dyads, - k-mer length, - frequencies counted on a single or on both strand, - accept or not self-overlaps for periodic patterns.

Use matrix-scan to scan the annotated binding sites with a PSSM

## Get the list of recovered target genes
## We sort with option -u (unique) because some genes may have several predicted bingind sites
grep -v '^;' results/AraC/allup_matches_with_deg-consensus_AraC.ft \
  | cut -f 4 | sort -u \
  > results/AraC/TFBS_matches_with_deg-consensus_AraC_genes.txt
cat results/AraC/TFBS_matches_with_deg-consensus_AraC_genes.txt

Compare the coverage rate of the two motifs

Binding site prediction in all promoters

Use the same tools (dna-pattern and matrix-scan) to predict binding sites in all the promoters of E.coli.
For matrix-scan, run the analysis with a threshold of p-value of either 0.001 or 0.0001.
Compare the number of matches obtained in these respective searches.
With the respective p-values used for the scanning, how many matches would you expect by chance ?

Negative control 1: scan artificial sequences with your motif

RSAT random sequences

Negative control 2: permute the columns of the matrix

Use the tool permute-matrix in order to generate 10 randomized copies of the motif
Send these randomiazed matrices to convert-matrix and check their logo.
Run the same analyses as above with the randomized matrix
Compare the number of sites obtained between the RegulonDB matrix and the randomized matrix derived from it.