The Law of Large Numbers

If you repeat an experiment many times, the average of the results tends to stabilize. This simple observation underpins much of statistics and numerical methods.

First: the formula we will use

Consider a collection of independent and identically distributed observations \(X_1,X_2,\dots\) with finite expectation \(\mu=E[X_1]\). The quantity we observe is the sample mean

\[ \overline{X}_n \;=\; \frac{1}{n}\sum_{i=1}^n X_i. \]

The intuitive question is: what happens to \(\overline{X}_n\) as \(n\) grows? The answer: it tends to \(\mu\) according to various notions of convergence. The two most common are convergence in probability (WLLN) and almost sure convergence (SLLN).

Why the mean stabilizes (intuition and Chebyshev)

Assume the variance \(\sigma^2=Var(X_1)\) exists. The variance of the sum \(\sum_{i=1}^n X_i\) grows linearly with \(n\). Normalizing by \(n^2\) gives the variance of the mean:

\[ Var(\overline{X}_n)=\frac{n\sigma^2}{n^2}=\frac{\sigma^2}{n}, \]

which goes to zero as \(n\to\infty\). Chebyshev's inequality turns this into a probabilistic statement: for every \(\varepsilon>0\),

\[ P\big(|\overline{X}_n-\mu|>\varepsilon\big) \le \frac{Var(\overline{X}_n)}{\varepsilon^2} = \frac{\sigma^2}{n\varepsilon^2} \xrightarrow[n\to\infty]{} 0. \]

This is the weak law of large numbers: for large \(n\) the probability of a significant deviation becomes small. It's a quantitative guideline: to make the probability of an error larger than \(\varepsilon\) small, choose \(n\) so that \(\sigma^2/(n\varepsilon^2)\) is small.

Why the strong version is deeper (Borel–Cantelli and truncation)

The WLLN does not assert that for each individual sequence of outcomes the mean converges; it only says the probability that the mean is far from \(\mu\) goes to zero. The SLLN strengthens the conclusion: with probability 1 the sequence of sample means converges to \(\mu\). Proving the SLLN requires controlling distribution tails: rare but very large values can spoil convergence unless controlled.

The standard method is truncation: define truncated variables \(X_i^{(M)} = X_i\mathbf{1}_{\{|X_i|\le M\}}\). Truncated versions have controlled tails and one applies strong results to them; then use the Borel–Cantelli lemma to show that events \(|X_i|>M\) occur only finitely many times almost surely as \(M\to\infty\). Combining these steps yields the SLLN for the original variables.

The simple Borel–Cantelli lemma: if \(\sum_n P(A_n)<\infty\) then the probability that infinitely many of the events \(A_n\) occur is 0. This connects sums of probabilities with almost-sure behavior.

How fast does convergence happen? the role of the CLT

Knowing convergence is useful, but knowing the scale of fluctuations is often more important in practice. The Central Limit Theorem (CLT) states that if \(Var(X_1)=\sigma^2<\infty\), then

\[ \sqrt{n}\,(\overline{X}_n-\mu)\xrightarrow{d} N(0,\sigma^2). \]

This implies typical deviations of \(\overline{X}_n\) around \(\mu\) are of order \(1/\sqrt{n}\). For large \(n\), \(\overline{X}_n\) is approximately normal with mean \(\mu\) and variance \(\sigma^2/n\). Confidence intervals are therefore justified by the CLT.

Quantitative results (Berry–Esseen) provide an \(O(1/\sqrt{n})\) bound on the approximation error, with constants depending on third moments of the \(X_i\).

Concentration: non-asymptotic inequalities

Non-asymptotic bounds giving exponentially small probabilities of large deviations are often useful. For bounded or sub-Gaussian variables there are inequalities like Hoeffding and Bernstein.

\[ P\big(|\overline{X}_n-\mu|\ge\varepsilon\big) \le 2\exp(-2n\varepsilon^2) \quad\text{(Hoeffding, for variables in [0,1])} \]

Compared to Chebyshev these bounds are much stronger: the probability of large deviations decays exponentially in \(n\), not just as \(1/n\).

Practical example: Bernoulli and demo interpretation

For a simple didactic case let \(X_i\) be Bernoulli with parameter \(p\): \(X_i\in\{0,1\}\), \(P(X_i=1)=p\). Then \(\mu=p\) and variance \(p(1-p)\). The variance of the mean is:

\[ Var(\overline{X}_n)=\frac{p(1-p)}{n}. \]

The typical standard deviation is \(\sqrt{p(1-p)/n}\). In the demo, increasing \(n\) makes individual trajectories \(f_i(t)\) approach the horizontal line at \(p\) and the histogram of \(f_i(n)\) becomes more concentrated around \(p\). Increasing \(m\) (number of trajectories) with fixed \(n\) makes the histogram less noisy and better approximated.

An approximate 95% confidence interval for the final frequency is

\[ p \pm 1.96\sqrt{\frac{p(1-p)}{n}}. \]

For p near 0 or 1 and for small n this approximation can be poor; in those cases use exact methods (Clopper–Pearson) or transformations.

Practical takeaways

The LLN explains why repeating measurements and computing averages is effective. In labs or Monte Carlo simulations:

To get a standard error ≤ δ choose \(n\ge \sigma^2/\delta^2\).
Reducing error by a factor 2 requires quadrupling observations (scaling \(1/\sqrt{n}\)).
Check tail behavior: if variance is infinite (heavy tails) classical LLN forms may not apply; stable sums or transformations may be needed.

Now try the interactive simulation below to verify these points.

Interactive simulation — trajectories and histogram

m — number of trajectories

n — number of trials per trajectory

p — success probability

Show first k trajectories

Ready

Trajectories \(f_i(t)\) — dashed line = \(p\)

Histogram of final frequencies \(f_i(n)\)

Results will appear here after the simulation.

Quick references

W. Feller — An Introduction to Probability Theory and Its Applications.
P. Billingsley — Probability and Measure.
R. Durrett — Probability: Theory and Examples.
Hoeffding (1963), Berry (1941), Esseen (1942) — for concentration and CLT rates.