Abbreviation | Description |
---|---|
TF | Transcription factor |
TFBS | Transcription factor binding site |
TFBM | Transcription factor binding motif |
PWM | Position-weight matrix |
PSSM | Position-specific scoring matrix |
RSAT | Regulatory Sequence Analysis Tools |
TSS | Transcription start site |
TTS | Transcription termination site |
UTR | Untranslated region |
A regulon is the set of all the genes regulated by a given transcription factor (e.g. the LexA regulon).
An operon is a transcriptional unit encompassing several genes. The operon was initially defined by the so-caleed cis-trans test, and is therefore also called polycistronic transcription unit.
Operons are very frequent in Bacteria, where they usually regroup sets of genes involved in a same biological process (e.g. the his operon contains the genes required to code for all the enzymes of histidine biosynthesis).
In eukaryotes, genes ere generally transcribed as mono-cistronic units (an mRNA only codes for a single polypeptide). A notable exception if found in Trypanosoma, where large polycistronic units are transcribed, which regroup functionally unrelated genes, and regulation occurs essentially at the post-transcriptional level.
The IUPAC code associates one letter to each possible combination of nucleotides.
Symbol | Nucleotide(s) | Description |
---|---|---|
A | A | Adenosine |
C | C | Cytidine |
G | G | Guanosine |
T | T | Thymidine |
R | A or G | puRines |
Y | C or T | pYrimidines |
W | A or T | Weak hydrogen bonding |
S | G or C | Strong hydrogen bonding |
M | A or C | aMino group at common position |
K | = G or T | Keto group at common position |
H | = A, C or T | not G |
B | = G, C or T | not A |
V | = G, A, C | not T |
D | = G, A or T | not C |
N | = G, A, C or T | aNy |
The consensus is a string-based representation of a motif, indicating the conserved residues in each column of a multiple alignment. The consensus is obtained by retaining, at each position of the alignment, a single residue (strict consensus) or a combination of representative residues (degenerate consensus).
In the context of regulatory sequences, a consensus is typically used to synthesize the common residues found in a collection of aligned binding sites for a transcription factor. The degenerate consensus is based on the IUPAC code for ambiguous nucleotides. For example, the consensus CACGTK means “CACTG” followed by either T or G.
The statistics defined below are of very wide use in bioinformatics and biomedical sciences. They are used to evaluate the reliability of supervised classification and predictive approaches. In the context of this course, we will consider them under the angle of the prediction: let us assume that we used an approach to predict binding sites – or target genes – for a given transcription factor, and we compare the prediction to the “truth”, for a dataset that we consider as the golden standard.
Prediction status | Actual feature (binding site, target gene) |
Actual non-feature |
---|---|---|
Positive (predicted as feature) |
True positive (\(TP\)) | False positive (\(FP\)) |
Negative (predicted as non-feature) |
False negative (\(FN\)) | True negative (\(TN\)) |
These four categories of predictions (\(TP\), \(FP\), \(TN\), \(FN\)) can in turn serve to define a set of derived statistics, providing complementary indications about the reliability of the predictions.
Symbol | Definition | Description |
---|---|---|
\(FP\) | False positive | Actual features (e.g. sites, target genes) predicted as positives |
\(FN\) | False negative | Actual non-features predicted as negatives |
\(TP\) | True positive | Actual features predicted as positives |
\(TN\) | True negative | Actual non-features predicted as negatives |
\(Sn\) | \(Sn = \frac{TP}{TP+FN}\) | Sensitivity: proportion of true positives (\(TP\)) among the actual features (\(TP+FN\)). Also called coverage. |
\(PPV\) | \(PPV=\frac{TP}{TP+FP}\) | Positive predictive value: proportion of true positives (\(TP\)) among the predicted features (\(TP+FP\)). |
\(FPR\) | \(FPR = \frac{FP}{TN+FP}\) | False positve rate: proportion of false positive under the null model, i.e. among the actual non-features |
\(FDR\) | \(FDR = \frac{FP}{TP+FP} = 1 - PPV\) | False discovery rate: proportion of false positive among the features declared positive |