Probability theory is the mathematical framework for quantifying uncertainty. In its modern formulation, established by Andrey Kolmogorov in 1933, probability is rooted in measure theory, providing a rigorous foundation for statistical inference, stochastic processes, and information theory.

The Probability Space

A formal probability model is defined by a triplet $(\Omega, \mathcal{F}, P)$ , known as a probability space. Each component of this triplet serves a distinct mathematical purpose in capturing the structure of random phenomena.

The sample space $\Omega$ is a non-empty set containing all possible outcomes of an experiment. An element $\omega \in \Omega$ represents a single, highly specific outcome.

The event space $\mathcal{F}$ is a $\sigma$ -algebra on $\Omega$ . A collection of subsets $\mathcal{F} \subseteq 2^\Omega$ is a $\sigma$ -algebra if it satisfies three conditions:

$\Omega \in \mathcal{F}$ .
If $A \in \mathcal{F}$ , then its complement $A^c \in \mathcal{F}$ .
If $A_1, A_2, \dots \in \mathcal{F}$ , then their countable union $\bigcup_{i=1}^\infty A_i \in \mathcal{F}$ .

Elements of $\mathcal{F}$ are called events. The restriction to a $\sigma$ -algebra (rather than the entire power set $2^\Omega$ ) is mathematically necessary when dealing with uncountably infinite sample spaces, such as the real line $\mathbb{R}$ , to avoid paradoxes associated with non-measurable sets (e.g., the Banach-Tarski paradox).

The probability measure $P$ is a function $P: \mathcal{F} \to [0, 1]$ satisfying Kolmogorov’s axioms:

Non-negativity: $P(A) \ge 0$ for all $A \in \mathcal{F}$ .
Unit measure: $P(\Omega) = 1$ .
Countable additivity: For any countable sequence of pairwise disjoint events $A_1, A_2, \dots$ (where $A_i \cap A_j = \emptyset$ for $i \neq j$ ), $P\left(\bigcup_{i=1}^\infty A_i\right) = \sum_{i=1}^\infty P(A_i)$

From these axioms, foundational properties emerge seamlessly. For example, the probability of the empty set must be $0$ . Since $\Omega$ and $\emptyset$ are disjoint and $\Omega \cup \emptyset = \Omega$ , we have $P(\Omega) = P(\Omega) + P(\emptyset) \implies 1 = 1 + P(\emptyset) \implies P(\emptyset) = 0$ .

Which of the following is NOT required for a collection of subsets to form a \sigma-algebra?

Independence and Conditional Probability

Two events $A$ and $B$ are independent if the occurrence of one does not alter the probability of the other. Mathematically, this is defined as: $P(A \cap B) = P(A)P(B)$

When events are not independent, partial information changes our uncertainty. The conditional probability of an event $A$ given that event $B$ has occurred (with $P(B) > 0$ ) is defined as: $P(A \mid B) = \frac{P(A \cap B)}{P(B)}$

Rearranging this definition yields the multiplication rule $P(A \cap B) = P(A \mid B)P(B)$ . This straightforward algebraic manipulation leads to Bayes’ Theorem, a foundational result tying forward and inverse probabilities: $P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}$

The denominator $P(B)$ is often expanded using the Law of Total Probability. For a partition $A_1, A_2, \dots, A_n$ of the sample space $\Omega$ , we have: $P(B) = \sum_{i=1}^n P(B \mid A_i)P(A_i)$

Medical Testing Accuracy

A disease affects 1% of a population. A diagnostic test correctly identifies the disease 99% of the time when a patient is infected (true positive). However, it also incorrectly indicates disease 5% of the time for healthy patients (false positive). A randomly selected individual tests positive.

Using Bayes' Theorem, what is the exact probability that the individual actually has the disease?

Random Variables and Distributions

A random variable is not a variable, nor is it inherently random. It is a deterministic function $X: \Omega \to \mathbb{R}$ that maps outcomes to real numbers. Crucially, $X$ must be a measurable function. This means that for any Borel set $B \subseteq \mathbb{R}$ , its preimage must be an event in our $\sigma$ -algebra: $X^{-1}(B) = \{ \omega \in \Omega : X(\omega) \in B \} \in \mathcal{F}$

The probability distribution of $X$ is completely determined by its Cumulative Distribution Function (CDF), $F_X(x)$ , defined as: $F_X(x) = P(X \le x) = P(\{ \omega \in \Omega : X(\omega) \le x \})$ Every valid CDF is right-continuous, monotonically non-decreasing, with $\lim_{x \to -\infty} F_X(x) = 0$ and $\lim_{x \to \infty} F_X(x) = 1$ .

Discrete vs. Continuous Distributions

A random variable is discrete if it takes values in a countable set. It is described by a Probability Mass Function (PMF) $p_X(x) = P(X = x)$ . A random variable is continuous if there exists a non-negative Lebesgue-integrable function $f_X(x)$ , called the Probability Density Function (PDF), such that: $F_X(x) = \int_{-\infty}^{x} f_X(t) \, dt$ For continuous variables, the probability of any single precise point is strictly zero: $P(X=x) = 0$ . Probabilities are only assigned to intervals.

Which of the following statements about the Cumulative Distribution Function (CDF) is always mathematically accurate for any random variable?

Expected Value: The Lebesgue Perspective

The expected value $\mathbb{E}[X]$ of a random variable is the probability-weighted average of all its possible values. In an elementary context, it is formulated as a sum for discrete variables $\sum x_i p(x_i)$ and a Riemann integral for continuous variables $\int x f(x) dx$ .

A more unified, rigorous approach utilizes the Lebesgue integral over the probability space: $\mathbb{E}[X] = \int_{\Omega} X(\omega) \, dP(\omega)$ This single definition naturally covers discrete, continuous, and mixed random variables, treating probability distributions simply as specific measures.

The expected value possesses the critical property of linearity. For any random variables $X$ and $Y$ , and constants $a, b \in \mathbb{R}$ : $\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$ Linearity holds identically whether $X$ and $Y$ are independent or heavily correlated.

Variance and Moments

To quantify the dispersion or spread of a probability distribution around its center, we examine the second central moment, the variance: $\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$ The variance strictly requires that $\mathbb{E}[X^2]$ (the second moment) is finite. Unlike expectation, variance is not a linear operator. For constants $a, b$ : $\text{Var}(aX + b) = a^2 \text{Var}(X)$ For the sum of two random variables, the variance is given by: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$ If $X$ and $Y$ are independent, their covariance $\text{Cov}(X, Y)$ is zero, rendering the variance strictly additive.

Linear Transformations of Portfolios

A quantitative analyst models the daily return of two technology stocks, A and B. Both stocks have an expected daily return of 2% and a standard deviation of 4%. The stocks are perfectly uncorrelated. The analyst constructs a portfolio that heavily weights stock A: they hold $3 worth of Stock A and -$1 worth of Stock B (a short position) to hedge.

What is the variance of the daily return of this portfolio P = 3A - 1B?

Limits and Asymptotic Theorems

The utility of a single measure or expectation dramatically extrapolates as we consider sequences of random variables $X_1, X_2, \dots$ Often, we are concerned with sums of independent and identically distributed (i.i.d.) random variables.

Two foundational theorems act as the bedrock for modern statistics.

Law of Large Numbers (LLN): Let $X_1, X_2, \dots, X_n$ be an i.i.d. sequence of random variables with finite expectation $\mu$ . The sample average $\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i$ converges to the expected value $\mu$ . The Strong Law ensures almost sure convergence ( $\Pr(\lim_{n \to \infty} \bar{X}_n = \mu) = 1$ ), whereas the Weak Law guarantees convergence in probability.
Central Limit Theorem (CLT): If the sequence also possesses a finite variance $\sigma^2 > 0$ , the standardized sample average converges in distribution to the standard normal distribution $\mathcal{N}(0,1)$ : $\lim_{n \to \infty} P \left( \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \le z \right) = \Phi(z)$ where $\Phi(z)$ is the CDF of the standard normal distribution.

The sheer power of the CLT stems from a distinct lack of distributional assumptions: regardless of whether the original variable $X$ is discrete, highly skewed, or uniform, the aggregate behavior of sums mathematically mandates a metamorphosis into the bell curve, underpinning almost all large-scale modeling and parametric tests.

Probability Theory