In this file we provide the solutions of the practical Scanning non-coding sequences with a TFBM with command-line use of the RSAT software suite.
Students will store their results in a shared spreadsheet, which will be used to compare their results and get a broader landscape from the comparison of the results obtained with different transcription factors.
In your computer, create a folder to store the results of this practical, for example: $HOME/LCG_BEII_practicals/
(you can change the path and name according to your own organisation of folders).
For this exercise, I chose the transcription factor AraC.
I define this in an environment variable. I also define and create a specific directory for the results related to this transcription factor.
## Define the reference organism
export ORG=Escherichia_coli_GCF_000005845.2_ASM584v2
## Choose a transcription factor (TF) of interes
export TF=AraC
export RESULT_DIR=results/${TF}
mkdir -p ${RESULT_DIR}
cd ${RESULT_DIR}
I use the REST Web services to automatically gather the annotations from RegulonDB. REST Web services enable to invoke remotely a resource (database, software tool) by composing an URL with an entry point (which specifies the type of query) and a set of parameters separated by &
.
For example, the list of genes regulated by AraC can be gound at the following URL.
http://regulondb.ccg.unam.mx/webresources/regulon/getRegulatedGenes?tfObject=AraC
They can then be stored in a file with a web aspirator, such as curl
or wget
.
## Get the annotated binding sites from RegulonDB
curl 'http://regulondb.ccg.unam.mx/webresources/tools/getTFBS?tfObject=AraC&extended=0' \
> results/AraC/AraC_RegulonDB_sites_ext0.tsv
## Get the annotated position-specific scoring matrix from RegulonDB
curl 'http://regulondb.ccg.unam.mx/webresources/tools/getPSSM?tfObject=AraC' \
> results/AraC/AraC_RegulonDB_PSSM.tab
## Get the annotated target genes from RegulonDB
curl 'http://regulondb.ccg.unam.mx/webresources/regulon/getRegulatedGenes?tfObject=AraC' \
> results/AraC/AraC_RegulonDB_genes.tab
The degenerate consensus can be computed with convert-matrix
with the appropriate parameters. Since it is printed as a comment (rows starging with ;
) we can extract its actual value with grep and cut.
## Define an environment variable with the file containing all upstream sequences
export ALLUP=results/AraC/Escherichia_coli_GCF_000005845.2_ASM584v2_all_upstream-noorf.fasta
## Retrieve all upstream sequences
retrieve-seq -org Escherichia_coli_GCF_000005845.2_ASM584v2 \
-from -1 -to -400 -noorf -all -label name \
-o ${ALLUP}
## Check the result (type "q" to quit the "less" command)
less ${ALLUP}
To scan sequences with a matrix, we need to specify a background model. We can either compute it from the input sequences themselves (option -bginput
) or specify a predefined background model file (option -bg_file
).
Pre-computed background models are available in RSAT for each organism, and with different parameters: - oligonucleotides or dyads, - k-mer length, - frequencies counted on a single or on both strand, - accept or not self-overlaps for periodic patterns.
## Get the list of recovered target genes
## We sort with option -u (unique) because some genes may have several predicted bingind sites
grep -v '^;' results/AraC/allup_matches_with_deg-consensus_AraC.ft \
| cut -f 4 | sort -u \
> results/AraC/TFBS_matches_with_deg-consensus_AraC_genes.txt
cat results/AraC/TFBS_matches_with_deg-consensus_AraC_genes.txt
Use the same tools (dna-pattern and matrix-scan) to predict binding sites in all the promoters of E.coli.
For matrix-scan, run the analysis with a threshold of p-value of either 0.001 or 0.0001.
Compare the number of matches obtained in these respective searches.
With the respective p-values used for the scanning, how many matches would you expect by chance ?
Use the tool permute-matrix in order to generate 10 randomized copies of the motif
Send these randomiazed matrices to convert-matrix and check their logo.
Run the same analyses as above with the randomized matrix
Compare the number of sites obtained between the RegulonDB matrix and the randomized matrix derived from it.