Jacques van Helden
2020-02-20
The expression discrete distribution denotes probability distribution of variables that only take discrete values (by opposition to continuous distributions).
Notes:
In probabilities, the observed variable (\(x\)) usually represents the number of successes of a series of tests, or the counts of some observation. In such cases, its values are natural numbers (\(x \in \mathbb{N}\)).
The probability \(\operatorname{P}(x)\) takes real values comprised between 0 and 1, but its distribution is said *discrete¨since it is only defined fora set of discrete values of \(X\). It is generally represented by a step function.
Application: waiting time until the first appeearance of an event in a Bernoulli schema.
Examples:
In a series of dices rollings, count the number rolls (\(x\)) before the first occurrence of a 6 (this occurrence itself is not taken into account).
Length of a DNA sequence before the first occurrence of a cytosine.
The Probability Mass Function (PMF) indicates the probability to observe a particular result.
For the geometric distribution, it indicates the probability to observe exactly \(x\) failures before the first success, in a series of independent trials with a probability of success \(p\).
\[\operatorname{P}(X = x) = (1-p)^x \cdot p\]
Justification:
Note: the PMF of discrete distributions relates to the concept of density used for continuous distributions.
The tails of a distribution are the areas comprised under the density curve up to a given value (left tail) or staring from a given value (right tail).
The right tail indicates the probability to observe a result (\(X\)) smaller than or equal to a given value (\(x\)): \(\operatorname{P}(X \le x\)).
The left tail of a distribution indicates the probability to observe a result higher than or equal to a given value: \(\operatorname{P}(X \ge x\)).
The binomial distribution indicates the probability to observe a given number of successes (\(x\)) in a series of \(n\) independent trials with constant success probability \(p\) (Bernoulli schema).
Binomial PMF
\[\operatorname{P}(X=x) = \binom{n}{x} \cdot p^x \cdot (1-p)^{n-x} = C_n^x p^x (1-p)^{n-x} = \frac{n!}{x!(n-x)!} p^x (1-p)^{n-x}\]
Binomial CDF
\[\operatorname{P}(X \ge x) = \sum_{i=x}^{n}{P(X=i)} = \sum_{i=x}^{n}{C_n^i p^i (1-p)^{n-i}}\]
Properties
Variance: \(\sigma^2 = n \cdot p \cdot (1 - p)\).
Standard deviation: \(\sigma = \sqrt{n \cdot p \cdot (1 - p)}\)
The binomial distribution can take various shapes depending on the values of its parameters (success probability \(p\), and number of trials \(n\)).
When the expectation (\(p \cdot n\)) is very small, the binomial distribution is monotonously decreasing and is qualified of \(i\)-shaped.
When the probability is relatively high but still lower than \(0.5\), the distribution takes the shape of an asymmetric bell.
When the success probability \(p\) is exactly \(0.5\), the binomial distribution takes the shape of a symmetrical bell.
Then the success probability is close to 1, the distirbution is monotonously increasing and is qualified of ***\(j\)-shaped distribution.
Note: the binomial assumes a Bernoulli schema. Forexamples 2 and 3 this amounts to consider that nucleotides are concatenated in an independent way, which is quite unrealistic.
The Poisson law describes the probability of the number of realisations of an event during a fixed time interval, assuming that the average number of events is constant, and that the events are independent (previous realisations do not affect the probabilities of future realisations).
\[P(X = x) = \frac{\lambda^x}{x!}e^{-\lambda}\]
Expectation (number of realisations expected by chnace): \(<X> = \lambda\) (by construction)
variance: \(\sigma^2 = lambda\) (the variance equals the mean!)
Standard deviation: \(\sigma = \sqrt{\lambda}\)
A bacterial population is submitted to a mutagen (chemical agent, irradiations). Each cell is affected by a particular number of mutations.
Taking into account the dosis of the mutagen (exposure time, intensity, concentration) one could take an empirical measure of the mean number of mutations by individual (expectation, \(\lambda\)).
The Poisson law can be used to describe the probability for a given cell to have a given number of mutations (\(x=0, 1, 2, ...\)).
In 1943, Salvador Luria and Max Delbruck demonstrated that when cultured bacteria are treated by an antibiotic, the mutations that confer resistance are not induced by the antibiotic itself, but preexist. Their demonstration relies on the fact that the number of antibiotic-resistant cells follows a Poisson law (Luria & Delbruck, 1943, Genetics 28:491–511).
Under some circumstances, the binmial law converges towards a Poisson.
TO DO
The negative binomial distribution (also called Pascal distribution) indicates the probability of the number of successes (\(k\)) before the \(r^{th}\) failure, in a Bernoulli schema with success probability \(p\).
\[\mathcal{NB}(k|r, p) = \binom{k+r-1}{k}p^k(1-p)^r\]
This formula is a simple adaptation of the binomial, with the difference that we know that the last trial must be a failure. The binomial coefficient is thus reduced to choose the \(k\) successes among the \(n-1 = k+r-1\) trials preceding the \(r^{th}\) failure.
It can also be adapted to indicate related probabilities.
\[\mathcal{NB}(r|k, p) = \binom{k+r-1}{r}p^k(1-p)^r\]
\[\mathcal{NB}(n|r, p) = \binom{n-1}{r-1}p^{n-r}(1-p)^r\]
The variance of the negative binomial is higher than its mean. It is therefore sometimes used to model distributions that are over-dispersed by comparisong with a Poisson.
\[\mathcal{NB}(r|k, p) = \binom{k+r-1}{r}p^k(1-p)^r\]
Each student chooses a value for the maximal number of failures (\(r\)).
help(NegBinomial)
rndbinom()
) to compute the distribution of the number of successes (\(k\)) before the \(r^{th}\) failure.r <- 6 # Number of failures
p <- 0.75 # Failure probability
rep <- 100000
k <- rnbinom(n = rep, size = r, prob = p)
max.k <- max(k)
exp.mean <- r*(1 - p)/p
rand.mean <- mean(k)
exp.var <- r*(1 - p)/p^2
rand.var <- var(k)
hist(k, breaks = -0.5:(max.k + 0.5), col = "grey", xlab = "Number of successes (k)",
las = 1, ylab = "", main = "Random sampling from negative binomial")
abline(v = rand.mean, col = "darkgreen", lwd = 2)
abline(v = exp.mean, col = "green", lty = "dashed")
arrows(rand.mean, rep/20, rand.mean + sqrt(rand.var), rep/20,
angle = 20, length = 0.1, col = "purple", lwd = 2)
text(x = rand.mean, y = rep/15, col = "purple",
labels = paste("sd =", signif(digits = 2, sqrt(rand.var))), pos = 4)
legend("topright", legend = c(
paste("r =", r),
paste("exp.mean =", signif(digits = 4, exp.mean)),
paste("mean =", signif(digits = 4, rand.mean)),
paste("exp.var =", signif(digits = 4, exp.var)),
paste("var =", signif(digits = 4, rand.var))
))
kable(data.frame(r = r,
exp.mean = exp.mean,
mean = rand.mean,
exp.var = exp.var,
var = rand.var), digits = 4)
r | exp.mean | mean | exp.var | var |
---|---|---|---|---|
6 | 2 | 1.9963 | 2.6667 | 2.6705 |