Probability — Interpretations, axiomatic theory, measure

1. Interpretations of probability

The term probability is used in different contexts and with meanings that, at first glance, may appear confusing or in conflict. The most common readings are:

1.1 Classical interpretation (Laplace)

This applies when the outcome space is finite and the elementary outcomes are considered equally likely for reasons of symmetry. The probability of an event is the ratio between the number of favorable outcomes and the total number of possible outcomes:

Example. roll of a fair die: \(P(\text{even number}) = 3/6 = 1/2\).

Limitation: it is not directly applicable to infinite spaces or continuous variables without an additional criterion that justifies equal weighting.

1.2 Frequentist interpretation

Probability is seen as the limit of relative frequencies in the long run: if an experiment is repeated many times, the fraction of times the event occurs tends to its probability.

Example. If we repeat a Bernoulli experiment \(N\) times and observe \(k_N\) successes, the frequency \(k_N/N\) may stabilize for large \(N\) towards a value \(p\).

Limitation: it requires the notion of repeatability and does not assign a probability to single non-repeatable events (e.g., the probability of a unique historical event).

1.3 Bayesian interpretation

Probability is a coherent measure of degree of belief or information of a rational agent. An updating rule (Bayes' theorem) is needed to change beliefs in light of new information.

Example. before testing a drug we have an initial belief about its efficacy (the prior). Trial data update that belief (the posterior).

Advantage: it allows assigning probabilities to hypotheses or parameters. Criticism: it requires an explicit choice of prior (but there are principles for 'non-informative' priors).

1.4 Geometric / measure interpretation

For continuous phenomena one introduces a measure on a space (e.g., Lebesgue measure on an interval) and normalizes it to obtain probabilities. A point chosen "at random" is modeled via the measure.

Example. choosing a point at random in the interval \([0,1]\) corresponds to the uniform measure; the length of a subinterval is its probability.

1.5 Propensity (physical tendencies)

In physics and philosophy of science the probability is sometimes interpreted as a tendency or disposition of the system to produce certain outcomes: it is an objective property of the system, not merely a matter of our ignorance.

1.6 Apparent conflicts and how axiomatization resolves them

The differences often arise from how one constructs the outcome space \((\Omega,\mathcal{F},P)\). The axiomatic approach (Kolmogorov) does not impose a unique construction of \(\Omega\) or \(P\): it imposes formal properties that any construction must satisfy (non-negativity, normalization, σ-additivity). In this way:

The classical construction is a special case (finite space with equal weights).
Lebesgue measure provides the geometric construction for continuous spaces.
The frequentist approach gives an empirical justification for certain constructions of \(P\).
The Bayesian uses the same axiomatic structure but interprets the assigned value as a degree of belief.

Thus there is no mathematical contradiction: the different readings operate at a semantic/interpretative level but converge on the same mathematical language.

2. Probability as a measure — technical details

2.1 Probability space

The formal model is the triple

\[ (\Omega, \mathcal{F}, P) \]

\(\Omega\): the set of outcomes (sample space).
\(\mathcal{F}\): a σ-algebra of measurable subsets (closed under complements and countable unions).
\(P:\mathcal{F}\to[0,1]\): a measure with \(P(\Omega)=1\) and σ-additivity.

Example of a σ-algebra: on \(\Omega=\mathbb{R}\) the standard σ-algebra is the Borel σ-algebra \(\mathcal{B}(\mathbb{R})\), generated by open intervals; one often also works with the completion with respect to a measure (e.g., Lebesgue measure).

2.2 Random variables and measurability

A random variable is a measurable function

\[ X:(\Omega,\mathcal{F}) \to (\mathbb{R},\mathcal{B}) \]

Measurability means that for every Borel set \(B\) the preimage \(X^{-1}(B)\in\mathcal{F}\). This ensures we can speak of the probability that \(X\) falls in a given set and define the pushforward distribution \(P_X\).

2.3 Expectation and the Lebesgue integral

The expectation of a random variable is the Lebesgue integral with respect to the probability measure:

\[ \mathbb{E}[X] = \int_\Omega X(\omega)\,P(d\omega) = \int_{\mathbb{R}} x\,P_X(dx). \]

This formalism is more general and robust than the Riemann integral: it allows handling heavy-tailed variables, discontinuous functions, convergence issues, etc.

2.4 Main notions of convergence

Almost everywhere convergence (a.e.): \(X_n(\omega)\to X(\omega)\) for every \(\omega\) outside a set of measure zero.
Convergence in probability: \(P(|X_n-X|>\varepsilon)\to 0\) for every \(\varepsilon>0\).
Convergence in \(L^1\): \(\mathbb{E}[|X_n-X|]\to 0\).

Fundamental results (law of large numbers, central limit theorems) are expressed in these terms.

3. Required axiomatic derivations

3.1 Subadditivity (Boole)

Statement. For any sequence of events \(A_1,A_2,\dots\) we have

\[ P\Big(\bigcup_{i=1}^\infty A_i\Big) \le \sum_{i=1}^\infty P(A_i). \]

Proof. Define disjoint sets

\[ B_1 = A_1,\qquad B_n = A_n \setminus \bigcup_{i=1}^{n-1} A_i\ (n\ge2). \]

Then the \(B_n\) are disjoint and \(\bigcup_n B_n = \bigcup_n A_n\). By σ-additivity

\[ P\Big(\bigcup_{i=1}^\infty A_i\Big) = \sum_{i=1}^\infty P(B_i). \]

Since \(B_i\subseteq A_i\) it follows \(P(B_i)\le P(A_i)\) and therefore summing yields the inequality.

3.2 Inclusion–exclusion principle

For a finite number of events \(A_1,\dots,A_n\) we have

\[ P\!\left(\bigcup_{i=1}^n A_i\right) = \sum_{k=1}^n (-1)^{k+1} \sum_{1 \le i_1 < \cdots < i_k \le n} P\!\left(A_{i_1} \cap \cdots \cap A_{i_k}\right). \]

Proof (idea). For each \(\omega\) consider the indicator functions \(1_{A_i}(\omega)\). The combinatorial identity for indicators gives the correct pointwise count; integrating with respect to \(P\) yields the formula for probabilities. An alternative proof proceeds by induction on \(n\) using the identity

\[ P\Big(\bigcup_{i=1}^{n+1}A_i\Big) = P\Big(\bigcup_{i=1}^n A_i\Big) + P(A_{n+1}) - P\Big(\bigcup_{i=1}^n (A_i\cap A_{n+1})\Big). \]

Practical note: the formula is exact but practically impossible to use if \(n\) is large; one therefore resorts to inequalities (Bonferroni) or probabilistic estimates.

4. References

A. N. Kolmogorov — Foundations of the Theory of Probability (1933).
P. Billingsley — Probability and Measure.
R. Durrett — Probability: Theory and Examples.
J. F. C. Kingman — Poisson Processes.