Problem

DNA is composed of 4 nucleotides denoted by the letters \(A\), \(C\), \(G\), \(T\). Proteins are made of 20 amino acids.

For each one of these two types of macromolecules, how many distinct oligomers can be formed by polymerizing 30 residues (“30-mers”) ?

Suggested approach: start by addressing a simpler form of the same problem, by starting with polymers of much smaller sizes: 1, then 2 residues, …
Generalize the formula for oligomers of an arbitrary size \(k\) (so-called k-mers in the domain), made of \(n\) distinct residues.
What is the name of the function resulting from this analysis?
In this process, which mode did you use to pick up the residues: with or without replacement?

Solution: enumeration of oligomers

The underlying process is a drawing with replacement: at each position of the sequence, we can choose any of the \(n\) residues (\(n=4\) for nucleotidic sequences, \(n=20\) for peptidic sequences).
Progressive approach of the solution
- Trivial case: single-residue sequence \(\rightarrow\) there are exactly \(n\) possibilities.
- Two-residue sequences: for each of the \(n\) possible residues at the first position, we can select \(n\) resodies for the second one \(\rightarrow\) there are \(n \cdot n = n^2\) possible dimers.
- Trimers: for each of these dimers, there are \(n\) possible residues that can be chosen for the\(3^{d}\) position \(\rightarrow\) there are \(n^2 \cdot n = n^3\) distinct trinucleotides.
Generalization to \(k\)-mers: there are \(n^k\) distinct sequences of size \(k\).
In our case, the sequence size was \(k=30\). We have thus
- \(N = n^k = 4^{30} = 1.15\times 10^{18}\) distinct oligonucleotides sequences
- \(N = n^k = 20^{20} = 1.07\times 10^{39}\) distinct oligopeptide sequences
If we consider the succession of numbers obtained for increasing oligomer sizes \(k=1, 2, \cdot\) we observe a geometric progression.

The geometric progression

The geometric progression is a succession of numbers where each term can be computed by multiplying the previous one by a constant factor.

\[x_i = x_{i-1} \cdot n\]

For a large size of \(k\) the formula can be developed.

\[\begin{aligned} x_k &= x_{k-1} \cdot n \\ &= (x_{k-2} \cdot n) \cdot n = x_{k-2} \cdot n^2 \\ &= x_{k-3} \cdot n^3 = \ldots = x_0 \cdot n^k \end{aligned}\]

In our case, the initial value is \(x_0=1\); \(k\) denotes the oligomer size, and \(n\) is the number of distinct residues used to form the oligomer (\(n=4\) for nucleic acids, \(n=20\) for amino acids).

Number of oligomers

Number of possible oligonucleotides (top) and oligopeptides (bottom) with either a linear (left) and logarithmic (right) scale for the ordinate.

Exercise 02.1: oligomers with no repeated residue

How many oligomers can be formed (DNA or peptides) that would contain exactly once each residue.

Suggested approach: progressively aggregate the residues whilst wondering, at each step, bow many residues have not yet been incorporated in the sequence.

Sub-questions:

Generalize the formula for sequences of items of any type, drawn from a set of arbitrary size \(n\).
What is the name of the corresponding function?
In this process, what is the mode of residue selection: with or without replacement?

Solution: oligomers with no repeated residue

First residue: \(n\) possibilities.
As soon as the first residue has been chosen, there are only \(n-1\) possibilities to draw a different residue for the second position. We thus have \(n \cdot (n-1)\) possible sequences for the first two residues.
For the third position, there are only \(n-2\) residues left; We thus have \(n \cdot (n-1) \cdot (n-2)\) possibilities for the 3 first positions of the sequence.
By extension of this reasoning, the total number of possible solutions (assuming \(n\) is not too small) will be

\[n! = n \cdot (n-1) \cdot \ldots \cdot 2 \cdot 1\]

In our case:
- \(n! = 4! = 24\) oligonuclEotides includinf each nucleotide exactly once (size 4).
- \(n! = 20! = 2.43\times 10^{18}\) oligopeptides (size 20).

The factorial function

Enumerates the number of possible permutations of a finite set of items
Drawing without replacement
Defined by a recursive formula

\[N = n! = \left\{ \begin{array}{ll} 1 & \text{if } n=0 \\ n \cdot (n-1)! &\text{otherwise} \end{array} \right.\]

Note: by definition, \(0! = 1\), which enables to compute \(1!\) and the subsequent numbers with the recursive formula.

For sufficiently large values of \(n\), a clearer formulation is

\[N = n \cdot (n-1) \cdot (n-2) \ldots 2 \cdot 1\]

Exercise 02.2: gene lists (with order)

A transcriptome experiment has been led to define the level of expression of all the yeast genes. Knowing that the genome contains \(6000\) genes, how many possible ways are there to select the \(15\) most expressed genes with their relative order?

Suggested approach: as previously, simplify the problem by starting from the minimal selection, and progressively increase the number of selected genes (1 gene, 2 genes, …).

Complementary questions:

Give the example of a familiar bet game related to this enumerating process.
Generalize the formula for any selection of a list of \(x\) items in a set containing \(n\) elements.

Solution 02.2: (ordered) lists of genes

This is a selection without replacement (indeed, each gene appears at one and only one position in the list of all genes), and ordered (a list with the same genes taken in a different orders would be considered as a different result).

For the first gene, there are \(n=6000\) possibilities.
As soon as the first gene has been defined, there are only \(5999\) possibilities left for the second gene. There are thus \(n \cdot (n-1) = 6000 \cdot 5999\) possibilities for the succession of the two first genes;
By extension, there are \(6000 \cdot 5999 \cdot 5998 \cdot \ldots \cdot 5986 = 4.62\times 10^{56}\) possibilities for the first 15 genes.
If we generalize this reasoning to the lists of the \(x\) first genes in a set of \(n\), we obtain \(N = n \cdot (n-1) \cdot (n-2) \cdot ... \cdot (n-x+1)\).

Note that this can be represented by a more compact formula.

\(N = n \cdot (n-1) \cdot (n-2) \cdot ... \cdot (n-x+1) = \frac{n!}{(n-x)!}\)

Arrangements

In combinatorics, the term arrangement denotes an orderless drawing without replacement, i.e. random drawing where the order of the item is taken in consideration, and where each already selected item cannot be selected as next element.

Number of arrangements of \(x\) items drawn in a set of size \(n\).

\[\begin{array}{ccl} A^x_n & = & \frac{n!}{(n - x)!} \\ & = & \frac{n(n-1) \ldots (n-x +1) (n - x) (n-x-1) \ldots 2 \cdot 1}{(n - x) (n-x-1) \ldots 2 \cdot 1} \\ & = & n \cdot (n-1) \cdot \ldots \cdot (n-x+1) \end{array} \]

Arrangements – Typical application

tricast (also named trifecta).
A bet where players must predict the three winner horses (\(x=3\)) of a race, and the exact order of their arrival. For \(n=15\) horses, there are \(n \cdot (n-1) \cdot (n-2) = 15 \cdot 14 \cdot 13 = 2730\) possibilities.

Exercise 02.3: unordered sets of genes

A transcriptomics experiment has been led to measure the level of expression of all yeast genes. Knowing that the genome contains \(6000\) genes, how many possibilities are there to select the \(15\) genes with the highest expression level without taking into account the relative order of those 15 genes?

Suggested approach: as previously, simplify the problem by starting from minimal selections (1 gene, 2 genes, …) and then generalize the formula.

Complementary questions:

Find an example of familiar bet game related to this problem.
Generalize the formula for the selection of a set of \(x\) genes from a genome containing \(n\) genes.
Do you know the name of the formula that resulted from this reasoning?

Solution 02.3: unordered sets of genes

For a single-gene selection, there are \(n=6000\) possibilities.
For 2 genes, there are \(n \cdot (n-1) = 6000 \cdot 5999\) arrangements, but these would cover each gene pair with two different orders: \((a,b)\) and \((b,a\)). The number of unordered subsets of two genes is thus \(N = n \dot (n-1)/2\).
Similarly, for 3 genes we need to divide the number of arrangements (\(A^x_n = \frac{n!}{(n-x)!} = 6000 \cdot 5999 \cdot 5998\)) by the number of permutations of the same gene triplet (\((a, b, c), (a, c, b), (b, a, c) \ldots\)), which gives \(\frac{6000!}{(6000-3)! 3!} = \frac{6000 \cdot 5999 \cdot 5888}{6} = 3.6\times 10^{10}\).
For \(15\) genes we obtain \(\frac{n!}{(n-x)!x!} = \frac{6000!}{5985! \cdot 15!} = 3.53\times 10^{44}\) possible combinations.

Combinations

A combination is a selection without replacement a finite set, where the order of drawing is taken into consideration.

The number of possible combinations of \(x\) numbers among \(n\) is provided by the binomial coefficient.

\[\binom{n}{x} = C^x_n = \frac{n!}{x! (n-x)!}\]

Attention: the relative positions of \(x\) and \(n\) are opposite in the two alternative notations for combinations \(binom{n}{x}\) (“\(x\) among \(n\)”) and (\(C^x_n\), “choose”).

Combinations – Typical application

trio, a variation of the tricast bet, where the order of arrival of the 3 winner horses is not taken in consideration.

\[\binom{n}{x} = \binom{15}{3} = C^3_{15} = \frac{15!}{3! 12!} = 455\]

loto (or lotto): each player checks 6 numbers within a grid containing 90 numbers. The number of possibilities is

\[\binom{n}{x} = \binom{90}{6} = C^6_{90} = \frac{90!}{6! 84!} = 6.2261463\times 10^{8}\]

Summary of the concepts and formulas

Drawings with / without replacement

There are two classical ways of drawing elements among a set: with or without replacement.

Drawing without replacement: each element can be selected at most once. Examples:
- loto game (also spelled lotto).
- Arbitrary selection of a subset of the genes from a genome.
Drawing with replacement: each element can be drawn zero, one or several times. Examples:
- Dice game. At each drawing of a dice, we have the same possible outcomes (6 sides).
- Generation of a random sequence, by iteratively adding a randomly selected residue (4 nucleotides for DNA, 20 aminoacids for proteins).

Elements of combinatorics

Arrangements: drawings with order, without replacement
Combinations: drawings without considering the order of the items, without replacement)

Choice of the appropriate formula

Formulas

Replacement	Order	Formula	Description
Yes	Yes	\(n^x\)	Geometric progression: ordered drawings (sequences), with replacement, of \(x\) items from a set of size \(n\)
No	Yes	\(n!\)	factorial: permutations of all elements of a set of size \(n\)
No	Yes	\(A^x_n = \frac{n!}{(n-x)!}\)	Arrangements : ordered drawing, without replacement, of \(x\) items in a set of size \(n\)
No	No	\(C^x_n = \binom{n}{x} = \frac{n!}{x! (n - x) !}\)	Combinations : orderless drawing, without replacement, of \(x\) items in a set of size \(n\)

Supplementary exercises

Exercise 02.5: oligopeptides \(3 \times 20\)

How many distinct oligopeptides of size \(k=60\) can be formed by using exactly \(3\) times each amino acid?

Solution 02.5: oligopeptides \(3 \times 20\)