### Abbreviations

Abbreviation Description
TF Transcription factor
TFBS Transcription factor binding site
TFBM Transcription factor binding motif
PWM Position-weight matrix
PSSM Position-specific scoring matrix
RSAT Regulatory Sequence Analysis Tools
TSS Transcription start site
TTS Transcription termination site
UTR Untranslated region

### Regulon

A regulon is the set of all the genes regulated by a given transcription factor (e.g. the LexA regulon).

### Operon

An operon is a transcriptional unit encompassing several genes. The operon was initially defined by the so-caleed cis-trans test, and is therefore also called polycistronic transcription unit.

Operons are very frequent in Bacteria, where they usually regroup sets of genes involved in a same biological process (e.g. the his operon contains the genes required to code for all the enzymes of histidine biosynthesis).

In eukaryotes, genes ere generally transcribed as mono-cistronic units (an mRNA only codes for a single polypeptide). A notable exception if found in Trypanosoma, where large polycistronic units are transcribed, which regroup functionally unrelated genes, and regulation occurs essentially at the post-transcriptional level.

### IUPAC code for ambiguous nucleotides

The IUPAC code associates one letter to each possible combination of nucleotides.

Symbol Nucleotide(s) Description
C C Cytidine
G G Guanosine
T T Thymidine
R A or G puRines
Y C or T pYrimidines
W A or T Weak hydrogen bonding
S G or C Strong hydrogen bonding
M A or C aMino group at common position
K = G or T Keto group at common position
H = A, C or T not G
B = G, C or T not A
V = G, A, C not T
D = G, A or T not C
N = G, A, C or T aNy

### Consensus (strict or degenerate)

The consensus is a string-based representation of a motif, indicating the conserved residues in each column of a multiple alignment. The consensus is obtained by retaining, at each position of the alignment, a single residue (strict consensus) or a combination of representative residues (degenerate consensus).

In the context of regulatory sequences, a consensus is typically used to synthesize the common residues found in a collection of aligned binding sites for a transcription factor. The degenerate consensus is based on the IUPAC code for ambiguous nucleotides. For example, the consensus CACGTK means “CACTG” followed by either T or G.

### Evaluation statistics

The statistics defined below are of very wide use in bioinformatics and biomedical sciences. They are used to evaluate the reliability of supervised classification and predictive approaches. In the context of this course, we will consider them under the angle of the prediction: let us assume that we used an approach to predict binding sites – or target genes – for a given transcription factor, and we compare the prediction to the “truth”, for a dataset that we consider as the golden standard.

Prediction status Actual feature
(binding site, target gene)
Actual non-feature
Positive
(predicted as feature)
True positive ($$TP$$) False positive ($$FP$$)
Negative
(predicted as non-feature)
False negative ($$FN$$) True negative ($$TN$$)

These four categories of predictions ($$TP$$, $$FP$$, $$TN$$, $$FN$$) can in turn serve to define a set of derived statistics, providing complementary indications about the reliability of the predictions.

Symbol Definition Description
$$FP$$ False positive Actual features (e.g. sites, target genes) predicted as positives
$$FN$$ False negative Actual non-features predicted as negatives
$$TP$$ True positive Actual features predicted as positives
$$TN$$ True negative Actual non-features predicted as negatives
$$Sn$$ $$Sn = \frac{TP}{TP+FN}$$ Sensitivity: proportion of true positives ($$TP$$) among the actual features ($$TP+FN$$). Also called coverage.
$$PPV$$ $$PPV=\frac{TP}{TP+FP}$$ Positive predictive value: proportion of true positives ($$TP$$) among the predicted features ($$TP+FP$$).
$$FPR$$ $$FPR = \frac{FP}{TN+FP}$$ False positve rate: proportion of false positive under the null model, i.e. among the actual non-features
$$FDR$$ $$FDR = \frac{FP}{TP+FP} = 1 - PPV$$ False discovery rate: proportion of false positive among the features declared positive