An Intuitive Explanation for the Birthday Paradox

The so-called “birthday paradox” concerns the fact that even in surprisingly small groups the probability for at least two people to share their birthdate is surprisingly high.

The combinatorics of the Birthday Paradox are not hard; in fact, it is a standard example in introductory textbooks. Yet, the typical exact treatment, provides no intuitive sense for the way the probability depends on the size of the sample and the population.

Here is an approximate solution that shows that the probability in question is, in fact, a Gaussian.

Exact Solution

The Birthday Paradox is equivalent of finding the probability to draw a sample of $k$ different items from a population of $N$ distinct elements, when sampling “with replacement” (meaning that it is permissible to select the same item multiple times).

When sampling $k$ different items, we have $N$ possible choices for the first, item, $N-1$ choices for the second, $N-2$ for the third, and so on. The probability to select $k$ different items is therefore:

\begin{align*} P(k, N) &= \left( \frac{N}{N} \right) \left( \frac{N-1}{N} \right) \dots \left(\frac{N-k}{N} \right) \\ &= \left( 1 - \frac{1}{N} \right) \left( 1 - \frac{2}{N} \right) \dots \left( 1 - \frac{k}{N} \right) \end{align*}

This expression straightforward to evaluate (on a computer), but does not given an intuitive insight into the reason why the Birthday Paradox should be true. Just looking at the expression, it is not clear how $P(k, N)$ depends on its parameters.

First Approximate Solution

Assuming that $k \ll N$, we can drop all terms higher-order terms when multiplying out the product, retaining only terms up to first order. The result is:

\begin{align*} P(k, N) &= \left( 1 - \frac{1}{N} \right) \dots \left( 1 - \frac{k}{N} \right) \\ &\approx 1 - \frac{1}{N} - \frac{2}{N} \dots - \frac{k}{N} \\ &= 1 - \frac{1}{N} \sum_j^k j \\ &= 1 - \frac{k(k+1)}{2N} \end{align*}

The key step is the use of the well-known identity for the sum of a sequence of integers:

\[ \sum_j^k j = \frac{k(k+1)}{2} \]

This gives us a first intuitive sense for the reason behind the Birthday Paradox: the sample size enters the expression for the probability quadratically, whereas the population size only provides a linear factor. As the sample size grows, its influence quickly becomes important.

Second Approximate Solution

We can develop an even better insight from a more sophisticated approximation (due to Feller: An Introduction to Probability Theory and Its Applications, Volume 1, Chapter 2, Section 3). Instead of expanding the product for $P(k, N)$ directly, we instead expand its logarithm:

\begin{align*} \log P(k, N) &= \log \left( 1 - \frac{1}{N} \right) + \dots + \log \left( 1 - \frac{k}{N} \right) \\ &\approx - \frac{1}{N} - \frac{2}{N} - \dots - \frac{k}{N} \\ &= - \frac{k(k+1)}{2N} \end{align*}

where the logarithm $\log(1+x)$ has been expanded for small $x$ as $\log(1+x) \approx 1 + x$.

Now taking exponentials to get rid of the logarithm on the left-hand side, we arrive at our final expression for the probability to select $k$ different items from a population of $N$ elements, when sampling with replacement:

\[ P(k,N) = \mathrm{e}^{-\frac{k(k-1)}{2N}} \]

This is essentially a Gaussian! In fact, we can fold the denominator into the square, drop sub-dominant contributions and arrive at the following, simple expression:

\[ P(k,N) = \mathrm{e}^{-\frac{1}{2} \left( k / \sqrt{N} \right)^2} \]

In words: the probability to select $k$ distinct elements from a population of size $N$ behaves like a Gaussian in $k$, with width proportional to $\sqrt{N}$. And it leads to a rule of thumb: when the sample size exceeds the square-root of the population size, the probability of finding no collisions diminishes very fast.

Now, that’s fairly easy to visualize…!