Cis-regulation - Concept definitions

Abbreviations
Regulon
Operon
IUPAC code for ambiguous nucleotides
Consensus (strict or degenerate)
Evaluation statistics
To be added
- TF binding site
- TF binding motif

Abbreviations

Abbreviation	Description
TF	Transcription factor
TFBS	Transcription factor binding site
TFBM	Transcription factor binding motif
PWM	Position-weight matrix
PSSM	Position-specific scoring matrix
RSAT	Regulatory Sequence Analysis Tools
TSS	Transcription start site
TTS	Transcription termination site
UTR	Untranslated region

Regulon

A regulon is the set of all the genes regulated by a given transcription factor (e.g. the LexA regulon).

Operon

An operon is a transcriptional unit encompassing several genes. The operon was initially defined by the so-caleed cis-trans test, and is therefore also called polycistronic transcription unit.

Operons are very frequent in Bacteria, where they usually regroup sets of genes involved in a same biological process (e.g. the his operon contains the genes required to code for all the enzymes of histidine biosynthesis).

In eukaryotes, genes ere generally transcribed as mono-cistronic units (an mRNA only codes for a single polypeptide). A notable exception if found in Trypanosoma, where large polycistronic units are transcribed, which regroup functionally unrelated genes, and regulation occurs essentially at the post-transcriptional level.

IUPAC code for ambiguous nucleotides

The IUPAC code associates one letter to each possible combination of nucleotides.

Symbol	Nucleotide(s)	Description
A	A	Adenosine
C	C	Cytidine
G	G	Guanosine
T	T	Thymidine
R	A or G	puRines
Y	C or T	pYrimidines
W	A or T	Weak hydrogen bonding
S	G or C	Strong hydrogen bonding
M	A or C	aMino group at common position
K	= G or T	Keto group at common position
H	= A, C or T	not G
B	= G, C or T	not A
V	= G, A, C	not T
D	= G, A or T	not C
N	= G, A, C or T	aNy

Consensus (strict or degenerate)

The consensus is a string-based representation of a motif, indicating the conserved residues in each column of a multiple alignment. The consensus is obtained by retaining, at each position of the alignment, a single residue (strict consensus) or a combination of representative residues (degenerate consensus).

In the context of regulatory sequences, a consensus is typically used to synthesize the common residues found in a collection of aligned binding sites for a transcription factor. The degenerate consensus is based on the IUPAC code for ambiguous nucleotides. For example, the consensus CACGTK means “CACTG” followed by either T or G.

Evaluation statistics

The statistics defined below are of very wide use in bioinformatics and biomedical sciences. They are used to evaluate the reliability of supervised classification and predictive approaches. In the context of this course, we will consider them under the angle of the prediction: let us assume that we used an approach to predict binding sites – or target genes – for a given transcription factor, and we compare the prediction to the “truth”, for a dataset that we consider as the golden standard.

Prediction status	Actual feature (binding site, target gene)	Actual non-feature
Positive (predicted as feature)	True positive (\(TP\))	False positive (\(FP\))
Negative (predicted as non-feature)	False negative (\(FN\))	True negative (\(TN\))

These four categories of predictions (\(TP\), \(FP\), \(TN\), \(FN\)) can in turn serve to define a set of derived statistics, providing complementary indications about the reliability of the predictions.

Symbol	Definition	Description
\(FP\)	False positive	Actual features (e.g. sites, target genes) predicted as positives
\(FN\)	False negative	Actual non-features predicted as negatives
\(TP\)	True positive	Actual features predicted as positives
\(TN\)	True negative	Actual non-features predicted as negatives
\(Sn\)	\(Sn = \frac{TP}{TP+FN}\)	Sensitivity: proportion of true positives (\(TP\)) among the actual features (\(TP+FN\)). Also called coverage.
\(PPV\)	\(PPV=\frac{TP}{TP+FP}\)	Positive predictive value: proportion of true positives (\(TP\)) among the predicted features (\(TP+FP\)).
\(FPR\)	\(FPR = \frac{FP}{TN+FP}\)	False positve rate: proportion of false positive under the null model, i.e. among the actual non-features
\(FDR\)	\(FDR = \frac{FP}{TP+FP} = 1 - PPV\)	False discovery rate: proportion of false positive among the features declared positive

Cis-regulation - Concept definitions

Jacques van Helden

2015-10-26

Abbreviations

Regulon

Operon

IUPAC code for ambiguous nucleotides

Consensus (strict or degenerate)

Evaluation statistics

To be added

TF binding site

TF binding motif