Combinatorics

Probabilities and statistics for bioinformatics (STAT1)

Jacques van Helden

2020-02-16

Enumerating oligonucleotides and oligopeptides

Problem

DNA is composed of 4 nucleotides denoted by the letters \(A\), \(C\), \(G\), \(T\). Proteins are made of 20 amino acids.

  1. For each one of these two types of macromolecules, how many distinct oligomers can be formed by polymerizing 30 residues (“30-mers”) ?

    Suggested approach: start by addressing a simpler form of the same problem, by starting with polymers of much smaller sizes: 1, then 2 residues, …

  2. Generalize the formula for oligomers of an arbitrary size \(k\) (so-called k-mers in the domain), made of \(n\) distinct residues.

  3. What is the name of the function resulting from this analysis?

  4. In this process, which mode did you use to pick up the residues: with or without replacement?

Solution: enumeration of oligomers

The geometric progression

The geometric progression is a succession of numbers where each term can be computed by multiplying the previous one by a constant factor.

\[x_i = x_{i-1} \cdot n\]

For a large size of \(k\) the formula can be developed.

\[\begin{aligned} x_k &= x_{k-1} \cdot n \\ &= (x_{k-2} \cdot n) \cdot n = x_{k-2} \cdot n^2 \\ &= x_{k-3} \cdot n^3 = \ldots = x_0 \cdot n^k \end{aligned}\]

In our case, the initial value is \(x_0=1\); \(k\) denotes the oligomer size, and \(n\) is the number of distinct residues used to form the oligomer (\(n=4\) for nucleic acids, \(n=20\) for amino acids).

Number of oligomers

Number of possible oligonucleotides (top) and oligopeptides (bottom) with either a linear (left) and logarithmic (right) scale for the ordinate.

Number of possible oligonucleotides (top) and oligopeptides (bottom) with either a linear (left) and logarithmic (right) scale for the ordinate.

Exercise 02.1: oligomers with no repeated residue

How many oligomers can be formed (DNA or peptides) that would contain exactly once each residue.

Suggested approach: progressively aggregate the residues whilst wondering, at each step, bow many residues have not yet been incorporated in the sequence.

Sub-questions:

Solution: oligomers with no repeated residue

\[n! = n \cdot (n-1) \cdot \ldots \cdot 2 \cdot 1\]

The factorial function

\[N = n! = \left\{ \begin{array}{ll} 1 & \text{if } n=0 \\ n \cdot (n-1)! &\text{otherwise} \end{array} \right.\]

Note: by definition, \(0! = 1\), which enables to compute \(1!\) and the subsequent numbers with the recursive formula.

For sufficiently large values of \(n\), a clearer formulation is

\[N = n \cdot (n-1) \cdot (n-2) \ldots 2 \cdot 1\]

Factorial

Exercise 02.2: gene lists (with order)

A transcriptome experiment has been led to define the level of expression of all the yeast genes. Knowing that the genome contains \(6000\) genes, how many possible ways are there to select the \(15\) most expressed genes with their relative order?

Suggested approach: as previously, simplify the problem by starting from the minimal selection, and progressively increase the number of selected genes (1 gene, 2 genes, …).

Complementary questions:

Solution 02.2: (ordered) lists of genes

This is a selection without replacement (indeed, each gene appears at one and only one position in the list of all genes), and ordered (a list with the same genes taken in a different orders would be considered as a different result).

Note that this can be represented by a more compact formula.

\(N = n \cdot (n-1) \cdot (n-2) \cdot ... \cdot (n-x+1) = \frac{n!}{(n-x)!}\)

Arrangements

In combinatorics, the term arrangement denotes an orderless drawing without replacement, i.e. random drawing where the order of the item is taken in consideration, and where each already selected item cannot be selected as next element.

Number of arrangements of \(x\) items drawn in a set of size \(n\).

\[\begin{array}{ccl} A^x_n & = & \frac{n!}{(n - x)!} \\ & = & \frac{n(n-1) \ldots (n-x +1) (n - x) (n-x-1) \ldots 2 \cdot 1}{(n - x) (n-x-1) \ldots 2 \cdot 1} \\ & = & n \cdot (n-1) \cdot \ldots \cdot (n-x+1) \end{array} \]

Arrangements – Typical application

Exercise 02.3: unordered sets of genes

A transcriptomics experiment has been led to measure the level of expression of all yeast genes. Knowing that the genome contains \(6000\) genes, how many possibilities are there to select the \(15\) genes with the highest expression level without taking into account the relative order of those 15 genes?

Suggested approach: as previously, simplify the problem by starting from minimal selections (1 gene, 2 genes, …) and then generalize the formula.

Complementary questions:

Solution 02.3: unordered sets of genes

Combinations

A combination is a selection without replacement a finite set, where the order of drawing is taken into consideration.

The number of possible combinations of \(x\) numbers among \(n\) is provided by the binomial coefficient.

\[\binom{n}{x} = C^x_n = \frac{n!}{x! (n-x)!}\]

Attention: the relative positions of \(x\) and \(n\) are opposite in the two alternative notations for combinations \(binom{n}{x}\) (“\(x\) among \(n\)”) and (\(C^x_n\), “choose”).

Combinations – Typical application

\[\binom{n}{x} = \binom{15}{3} = C^3_{15} = \frac{15!}{3! 12!} = 455\]

\[\binom{n}{x} = \binom{90}{6} = C^6_{90} = \frac{90!}{6! 84!} = 6.2261463\times 10^{8}\]

Summary of the concepts and formulas

Drawings with / without replacement

There are two classical ways of drawing elements among a set: with or without replacement.

  1. Drawing without replacement: each element can be selected at most once. Examples:

    • loto game (also spelled lotto).
    • Arbitrary selection of a subset of the genes from a genome.
  2. Drawing with replacement: each element can be drawn zero, one or several times. Examples:

    • Dice game. At each drawing of a dice, we have the same possible outcomes (6 sides).
    • Generation of a random sequence, by iteratively adding a randomly selected residue (4 nucleotides for DNA, 20 aminoacids for proteins).

Elements of combinatorics

Choice of the appropriate formula

Formulas

Replacement Order Formula Description
Yes Yes \(n^x\) Geometric progression: ordered drawings (sequences), with replacement, of \(x\) items from a set of size \(n\)
No Yes \(n!\) factorial: permutations of all elements of a set of size \(n\)
No Yes \(A^x_n = \frac{n!}{(n-x)!}\) Arrangements : ordered drawing, without replacement, of \(x\) items in a set of size \(n\)
No No \(C^x_n = \binom{n}{x} = \frac{n!}{x! (n - x) !}\) Combinations : orderless drawing, without replacement, of \(x\) items in a set of size \(n\)

Supplementary exercises

Exercise 02.5: oligopeptides \(3 \times 20\)

How many distinct oligopeptides of size \(k=60\) can be formed by using exactly \(3\) times each amino acid?

Solution 02.5: oligopeptides \(3 \times 20\)

How many distinct oligopeptides of size \(k=60\) can be formed by using exactly \(3\) times each amino acid?

Let us start by generating a particular sequence that fits these conditions, by concatenating 3 copies of each amino acid by alphabetic order.

AAACCCDDDEEEFFFGGGHHHIIIKKKLLLMMMNNNPPPQQQRRRSSSTTTVVVWWWYYY

Any permutation of these 60 letters is a valid solution. Here are three examples.

KFWLIYQGGQINDDHFCRRTTHMDELEFKWQAVAHPYWNSMTAYNSGMSCPLVEIKRPCV
QIYFSHKPVDWLTQFVAKIKMEYDMCGLHRDSVGRCCIFGRLHETWQWAPNAMYNENSTP
SMQLGGPQRWCWTFSANVVAKIMAFFYMGERKICKIHQVWPDPDHLTNHELTENYDRSYC

However, we have to take into account that any permutation between two identical amino acids will give an identical sequence. The difficulty of the exercise will thus be to enumerate the number of distinct permutations.