Probability theory, statistical inference, and data analysis.
May 2026
Probability theory is the mathematical framework for quantifying uncertainty. In its modern formulation, established by Andrey Kolmogorov in 1933, probability is rooted in measure theory, providing a rigorous foundation for statistical inference, stochastic processes, and information theory.
A formal probability model is defined by a triplet , known as a probability space. Each component of this triplet serves a distinct mathematical purpose in capturing the structure of random phenomena.
The sample space is a non-empty set containing all possible outcomes of an experiment. An element represents a single, highly specific outcome.
The event space is a -algebra on . A collection of subsets is a -algebra if it satisfies three conditions:
Elements of are called events. The restriction to a -algebra (rather than the entire power set ) is mathematically necessary when dealing with uncountably infinite sample spaces, such as the real line , to avoid paradoxes associated with non-measurable sets (e.g., the Banach-Tarski paradox).
The probability measure is a function satisfying Kolmogorov’s axioms:
From these axioms, foundational properties emerge seamlessly. For example, the probability of the empty set must be . Since and are disjoint and , we have .
Two events and are independent if the occurrence of one does not alter the probability of the other. Mathematically, this is defined as:
When events are not independent, partial information changes our uncertainty. The conditional probability of an event given that event has occurred (with ) is defined as:
Rearranging this definition yields the multiplication rule . This straightforward algebraic manipulation leads to Bayes’ Theorem, a foundational result tying forward and inverse probabilities:
The denominator is often expanded using the Law of Total Probability. For a partition of the sample space , we have:
A disease affects 1% of a population. A diagnostic test correctly identifies the disease 99% of the time when a patient is infected (true positive). However, it also incorrectly indicates disease 5% of the time for healthy patients (false positive). A randomly selected individual tests positive.
A random variable is not a variable, nor is it inherently random. It is a deterministic function that maps outcomes to real numbers. Crucially, must be a measurable function. This means that for any Borel set , its preimage must be an event in our -algebra:
The probability distribution of is completely determined by its Cumulative Distribution Function (CDF), , defined as: Every valid CDF is right-continuous, monotonically non-decreasing, with and .
A random variable is discrete if it takes values in a countable set. It is described by a Probability Mass Function (PMF) . A random variable is continuous if there exists a non-negative Lebesgue-integrable function , called the Probability Density Function (PDF), such that: For continuous variables, the probability of any single precise point is strictly zero: . Probabilities are only assigned to intervals.
The expected value of a random variable is the probability-weighted average of all its possible values. In an elementary context, it is formulated as a sum for discrete variables and a Riemann integral for continuous variables .
A more unified, rigorous approach utilizes the Lebesgue integral over the probability space: This single definition naturally covers discrete, continuous, and mixed random variables, treating probability distributions simply as specific measures.
The expected value possesses the critical property of linearity. For any random variables and , and constants : Linearity holds identically whether and are independent or heavily correlated.
To quantify the dispersion or spread of a probability distribution around its center, we examine the second central moment, the variance: The variance strictly requires that (the second moment) is finite. Unlike expectation, variance is not a linear operator. For constants : For the sum of two random variables, the variance is given by: If and are independent, their covariance is zero, rendering the variance strictly additive.
A quantitative analyst models the daily return of two technology stocks, A and B. Both stocks have an expected daily return of 2% and a standard deviation of 4%. The stocks are perfectly uncorrelated. The analyst constructs a portfolio that heavily weights stock A: they hold $3 worth of Stock A and -$1 worth of Stock B (a short position) to hedge.
The utility of a single measure or expectation dramatically extrapolates as we consider sequences of random variables Often, we are concerned with sums of independent and identically distributed (i.i.d.) random variables.
Two foundational theorems act as the bedrock for modern statistics.
Law of Large Numbers (LLN): Let be an i.i.d. sequence of random variables with finite expectation . The sample average converges to the expected value . The Strong Law ensures almost sure convergence (), whereas the Weak Law guarantees convergence in probability.
Central Limit Theorem (CLT): If the sequence also possesses a finite variance , the standardized sample average converges in distribution to the standard normal distribution : where is the CDF of the standard normal distribution.
The sheer power of the CLT stems from a distinct lack of distributional assumptions: regardless of whether the original variable is discrete, highly skewed, or uniform, the aggregate behavior of sums mathematically mandates a metamorphosis into the bell curve, underpinning almost all large-scale modeling and parametric tests.
Hypothesis testing is a formal mathematical framework for making inferential decisions about population parameters based on sample data. It provides a structured methodology to evaluate whether observed data yields sufficient evidence to reject a predefined baseline assumption.
The foundation of any statistical test consists of two mutually exclusive statements about a population parameter: the null hypothesis () and the alternative hypothesis ( or ).
The null hypothesis () typically represents a state of no effect, no difference, or the historical baseline. It is the hypothesis that is assumed true until statistical evidence indicates otherwise.
The alternative hypothesis () represents the claim or theory that the researcher asserts is true, provided the sample data provides sufficient evidence to reject .
For a population mean evaluated against a hypothesized value , tests are formulated in one of three ways:
The objective of the testing procedure is not to computationally “prove” , but rather to determine if there is enough evidence to reject it in favor of .
Because hypothesis testing relies on sample data rather than an exhaustive population census, inferential decisions are subject to probabilistic errors.
A Type I Error occurs when the null hypothesis is rejected when it is, in fact, true in the population. This is equivalent to a false positive. The probability of committing a Type I error is denoted by , which is also strictly defined as the significance level of the test.
A Type II Error occurs when the null hypothesis is not rejected when the alternative hypothesis is true. This is a false negative. The probability of a Type II error is denoted by .
The power of a statistical test is the probability of correctly rejecting a false null hypothesis. It is the compliment of the Type II error rate.
Power depends on several factors: the significance level , the sample size , the true effect size (the magnitude of the difference between the true parameter and ), and the population variance . Increasing sample size generally increases the power of a test.
A test statistic is a standardized value calculated from sample data during a hypothesis test. It measures the degree of agreement between the sample data and the null hypothesis.
Consider testing the mean of a normally distributed population with a known variance . Let be an independent and identically distributed (i.i.d.) random sample from . The sample mean follows a normal distribution:
Under the null hypothesis , the test statistic is constructed by standardizing :
If is true, the test statistic follows a standard normal distribution, . This distribution governs the probability of observing the test statistic.
The rejection region is the set of values for the test statistic that leads to the rejection of . Its boundaries are determined by the critical values, which depend on the pre-specified significance level and the directionality of the test.
For a two-tailed test at significance level , the critical values are . The decision rule is: Reject if .
For instance, when , . Therefore, if the calculated falls outside the interval , is rejected.
A factory produces steel cables with a specified mean breaking strength of $10,000$ N and a known standard deviation of $400$ N. A quality control engineer suspects the machinery needs calibration and takes a random sample of $n = 50$ cables. The sample mean breaking strength is $9,880$ N. The engineer runs a two-tailed hypothesis test with $\alpha = 0.05$.
Modern statistical software generally reports the p-value, an alternative to the critical value approach that provides more granular information regarding the strength of the evidence against .
The p-value is defined as the probability, calculated under the assumption that the null hypothesis is true, of obtaining a test statistic at least as extreme as the one actually observed.
For the standard normal test statistic :
Decision Rule:
A smaller p-value constitutes stronger evidence against the null hypothesis. It is crucial to note that the p-value is not the probability that the null hypothesis is true (). It is the probability of the data given the null hypothesis ().
In practical applications, the population variance is almost always unknown. Replacing the population standard deviation with the sample standard deviation changes the distribution of the test statistic.
When but is unknown, the test statistic follows a Student’s t-distribution with degrees of freedom ():
The t-distribution is symmetric and bell-shaped like the standard normal distribution but possesses heavier tails. These heavier tails artificially introduce more probability in the extremes, accounting for the additional uncertainty incurred by estimating continuous variance from a finite sample. As , the t-distribution converges to the standard normal distribution .
When conducting multiple hypothesis tests simultaneously on a single dataset, the probability of committing at least one Type I error compounds. If a researcher conducts independent tests each at significance level , the family-wise error rate (FWER)—the probability of making one or more false discoveries—is given by:
For example, performing 20 tests at yields an FWER of . Without correction, false positives are extremely likely.
The most conservative method to control the FWER is the Bonferroni correction. To maintain a given family-wise , each individual test is evaluated at a newly adjusted significance level:
If 20 tests are conducted and the desired global false positive rate is 5%, each individual p-value must be compared against .
While mathematically rigorous and guaranteed to bound the FWER under all forms of dependence among tests, the Bonferroni strictly reduces statistical power, exponentially increasing Type II error rates when the number of tests () is massive, as is common in genomics and machine learning algorithms.
Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution. Whereas probability theory deduces the behavior of a sample given known population parameters, statistical inference deduces the population parameters based on an observed sample.
Formally, we observe a sample which we assume is generated from a probability model belonging to a known family of distributions , where is an unknown parameter vector and is the parameter space. The objective is to estimate or make decisions about it.
A point estimator is any statistic (a function of the data that does not depend on any unknown parameters) used to infer the value of an unknown parameter in a statistical model. We denote the estimator as and the estimate (the realized value for a specific sample ) as .
How do we decide if an estimator is “good”? We evaluate its statistical properties across all possible samples of size .
1. Unbiasedness: An estimator is unbiased for if its expected value over all possible samples equals the true parameter value: The bias of an estimator is defined as . While unbiasedness is intuitively appealing, it is not always strictly necessary, especially if allowing a small bias significantly reduces the estimation error.
2. Mean Squared Error (MSE): A common measure of the quality of an estimator is its Mean Squared Error: Using the definitions of variance and bias, the MSE can be decomposed into: If an estimator is unbiased, its MSE is exactly its variance.
3. Consistency: An estimator (subscript emphasizes dependence on sample size) is consistent if it converges in probability to the true parameter value as the sample size : Consistency means that with an infinitely large amount of data, the estimator perfectly pinpoints the underlying parameter.
The Method of Moments (MoM) is one of the oldest methods of deriving point estimators. It is based on equating the sample moments to the population moments, thereby obtaining a system of equations to solve for the unknown parameters.
The -th population moment is a function of the parameter vector : The -th sample moment is calculated from the data:
If we have unknown parameters, , we set up a system of equations: Solving this system yields the Method of Moments estimator .
Maximum Likelihood Estimation is a formal, unified approach to parameter estimation. It frames estimation as finding the parameter value that makes the observed data “most probable” or “most likely” to have occurred.
Let be the probability density function (PDF) or probability mass function (PMF) of our distribution. Given an observed sample of independent and identically distributed (i.i.d.) random variables, the likelihood function is the joint density evaluated at the observed data, viewed as a function of the parameter :
The Maximum Likelihood Estimator is the value that maximizes . Because the natural logarithm is a strictly increasing function, it is computationally and analytically easier to maximize the log-likelihood function:
Assuming standard regularity conditions (e.g., differentiability with respect to and the support of the distribution not depending on ), the MLE can be found by solving the score equation: and verifying that the second derivative is negative (Concavity).
Under mild regularity conditions, the MLE has remarkable asymptotic properties:
We have an i.i.d. sample $X_1, X_2, \dots, X_n$ from $U(0, \\theta)$. We previously saw that the MoM estimator is $\\hat{\\theta}_{MoM} = 2\\bar{X}$. Now, let's derive the MLE.
A statistic is sufficient for if the conditional distribution of the sample given does not depend on . Intuitively, contains all the information in the sample about ; no other function of the data can provide further insights regarding the value of .
Proving sufficiency directly via conditional probabilities can be tedious. Instead, we use the Fisher-Neyman Factorization Theorem: A statistic is sufficient for if and only if the joint PDF (or PMF) of the sample can be factored into two components: where is a non-negative function that depends only on the data, and is a non-negative function that depends on the parameter and the data strictly through the statistic .
Sufficiency plays a vital role in optimal estimation. The Rao-Blackwell theorem formalizes this: if you have an unbiased estimator and a sufficient statistic , the conditional expectation defines a new estimator that is also unbiased and has a variance less than or equal to the variance of the original estimator . Conclusively, optimal estimators should always be functions of a sufficient statistic.
When developing estimators, mathematical statisticians want to know the absolute best possible variance an unbiased estimator can achieve. Does an absolute limit exist, beyond which no estimator can improve?
Yes, under regularity conditions (primarily that the parameter space is an open interval and the support does not depend on ), the Cramér-Rao Lower Bound places a theoretical lower limit on the variance of any unbiased estimator of a parameter : where is the Fisher Information defined as:
If the variance of an unbiased estimator exactly equals the Cramér-Rao lower bound, it is deemed efficient (simultaneously proving it is the Uniformly Minimum Variance Unbiased Estimator - UMVUE). As noted earlier, Maximum Likelihood Estimators asymptotically achieve this lower bound, validating their massive prevalence in modern statistics.
While point estimators output a single best guess for a parameter (), interval estimators yield a range of plausible values constructed such that the random interval covers the true parameter with a specified probability , referred to as the confidence level.
Formally, a confidence interval for is defined by two random variables and such that:
The most common technique to systematically derive confidence intervals relies on finding a pivotal quantity (or “pivot”). A random variable is a pivot if:
If a pivot exists, constructing an interval estimator proceeds straightforwradly by finding constants and from the known distribution of such that: We then algebraically invert the inequalities inside the probability statement to isolate in the center:
We have a sample $X_1, \\dots, X_n$ from a Normal distribution $\\mathcal{N}(\\mu, \\sigma^2)$ where variance $\\sigma^2$ is known and we must determine a $1-\\alpha$ confidence interval for the mean $\\mu$.
When finite-sample pivot methods are intractable, statisticians leverage the asymptotical distribution of the Maximum Likelihood Estimator to construct approximate confidence regions. Since , where is the observed Fisher Information evaluated at the MLE, we use the asymptotic standard error: This yields the standard large-sample Wald confidence interval of the form:
Modern inference requires balancing various optimization properties:
Statistical inference provides the comprehensive foundation for drawing meaningful, mathematically strict conclusions from randomized noisy data under uncertainty.
Frequentist statistics interprets probability strictly as the long-run expected frequency of repeatable events. Bayesian statistics interprets probability fundamentally differently: as a degree of belief or a quantification of uncertainty. The Bayesian paradigm provides a rigorous mathematical framework for evaluating and updating our state of knowledge as new data becomes available.
The core operating principle of Bayesian inference is Bayes’ Theorem, a mathematical identity derived from the definition of conditional probability:
Where:
Because the denominator does not depend on , Bayes’ theorem is often written as a proportionality:
The differences between the two schools of thought run deep, impacting how inference is conducted and interpreted.
The choice of the prior distribution is a critical and sometimes criticized aspect of Bayesian analysis. Priors encode expert knowledge and initial assumptions.
An informative prior asserts specific, strong beliefs about the parameter space. For example, if measuring human height, a prior tightly clustered around meters is highly informative. An uninformative (or diffuse) prior spreads probability mass across the parameter space, attempting to let the data “speak for itself.” A uniform distribution is a common example, though true non-informativeness is mathematically subtle.
A prior is conjugate to a specific likelihood function if the resulting posterior distribution belongs to the same probability family as the prior. Conjugacy provides immense mathematical convenience because the posterior can be derived algebraically without complex numerical integration.
Examples of natural conjugate pairs include:
Consider the Beta-Binomial model. If the prior for the probability of success is and the newly observed data contains successes and failures, the posterior is simply:
When seeking an uninformative prior, a flat uniform distribution can be problematic because it is not invariant under parameter transformations (e.g., a uniform prior on the standard deviation is not uniform on the variance ). The Jeffreys Prior solves this by deriving the prior directly from the Fisher Information of the likelihood function:
This guarantees that the prior remains uninformative regardless of how the parameter is parameterized mathematically.
Historically, the difficulty of computing the normalizing constant analytically restricted Bayesian methods to conjugate models. The advent of modern computing and Markov Chain Monte Carlo (MCMC) algorithms revolutionized Bayesian statistics, allowing inference on virtually any model.
MCMC algorithms do not attempt to calculate the posterior distribution analytically. Instead, they draw a vast number of correlated samples directly from the posterior space. By analyzing these samples (e.g., taking the mean, variance, or percentiles of the samples), we can estimate the properties of the posterior distribution.
The algorithm constructs a Markov Chain—a sequence of states where the next state depends only on the current state—designed such that its stationary distribution is exactly the target posterior distribution.
A specialized and highly effective MCMC algorithm for multi-dimensional parameter spaces is Gibbs Sampling. Instead of trying to update all parameters simultaneously, Gibbs sampling updates one parameter at a time by sampling from its conditional distribution, keeping all other parameters fixed at their current values.
Let . A Gibbs step involves:
This iterative process vastly simplifies the sampling problem because the one-dimensional conditional distributions are often well-known and easy to sample from, even when the joint multidimensional posterior is impossibly complex.
You are a doctor administering a test for a rare genetic marker present in 0.1% (p=0.001) of the population. The test's sensitivity (true positive rate) is 99% (P(Positive|Marker) = 0.99). The test's specificity (true negative rate) is 98%, meaning the false positive rate is 2% (P(Positive|No Marker) = 0.02). A patient receives a positive test result. The patient immediately asks: 'What is the probability I actually have the marker?'
Below is an illustration utilizing the Beta-Conjugate prior for a binomial likelihood, perfectly modeling the continuous updating of beliefs about a coin’s hidden fairness parameter. Observe how the posterior from one experiment becomes the prior for the next.
Read the code, make a small change, then run it and inspect the output. Runtime setup messages stay outside the terminal so the result remains focused on what the program prints.
A Markov Chain is a mathematical system that undergoes transitions from one state to another on a state space. It is a stochastic process characterized by the Markov property: the conditional probability distribution of future states of the process depends only upon the present state, not on the sequence of events that preceded it.
Formally, a stochastic process is a Markov chain if, for all and any sequence of states , the following equality holds:
This fundamental property states that the entire history of the process is encapsulated in its current state . This drastically simplifies the study of complex systems, reducing an infinite-dimensional dependency into a single-step conditional probability. Discrete and continuous-time variants form the backbone of modern stochastic modeling, encompassing applications ranging from simple queuing systems to complex financial models and molecular dynamics.
A Discrete-Time Markov Chain operates with a discrete time parameter . The set of possible values for the random variables forms a countable set , called the state space. The probability of moving from state to state in one time step is given by the transition probability , defined as:
When these transition probabilities are independent of the time step , the Markov chain is said to be time-homogeneous. We will strictly focus on time-homogeneous chains, as their structure permits robust long-term behavioral analysis.
For a state space containing a finite number of states (or countably infinite), the one-step transition probabilities are arranged in a matrix , called the transition matrix:
This matrix has two vital properties:
Every row describes a probability distribution, making a stochastic matrix. If the initial distribution of the chain is a row vector (where ), the distribution after one step is . By induction, the probability distribution of the state after steps is given by . The matrix multiplication organically computes the sum over all possible paths of length between any two states, weighting each path by its probability.
The -step transition probability is the probability that a process currently in state will be in state exactly steps later:
For , . For , is if and otherwise.
The computation of -step transition probabilities is fundamentally governed by the Chapman-Kolmogorov equations. These equations provide a rigorous method for computing the probability of moving from state to state in steps by conditioning on the intermediate state attained after steps:
In matrix notation, this corresponds exactly to the multiplication of powers of the transition matrix: Let be the matrix whose entries are . Then . Consequently, . The equation elegantly states that the transition matrix for steps is the -th power of the 1-step transition matrix.
The long-term behavior of a Markov chain is heavily dependent on the communication structure and the topological arrangement of its state space.
Communication is an equivalence relation (it is reflexive, symmetric, and transitive), which partitions the state space into disjoint communication classes. If a Markov chain has only one communication class—meaning every state is accessible from every other state—it is called irreducible.
Let denote the probability that the first transition into state (starting from ) occurs exactly at step :
Let be the probability of ever reaching state given that the chain started in state . The parameter is therefore the probability of ever returning to state given that the chain started in state .
A state is recurrent if and only if the expected number of returns to that state is infinite: . It is transient if and only if . Every finite Markov chain has at least one recurrent state, though an infinite state space may consist entirely of transient states (e.g., a simple random walk on ).
The period of a state is defined as the greatest common divisor (GCD) of the set of numbers of steps for which a return to state is possible:
For irreducible chains, periodicity is a class property: all states in the same communication class have the same period.
A state is positive recurrent if it is recurrent and its expected return time is finite: If a state is positive recurrent and aperiodic, it is classified as ergodic. A Markov chain is defined as ergodic if all its states are ergodic. Ergodicity is the bedrock property guaranteeing that a system will eventually “forget” its initial state and settle into a stable proportional equilibrium.
When an ergodic Markov chain runs for a sufficiently long time, its distribution approaches a steady state, completely independent of the starting state. This limiting distribution is called the stationary distribution, denoted by a row vector .
A probability distribution is a stationary distribution if:
The condition indicates that if you start the chain randomly by picking the initial state according to the distribution , the state distribution at any subsequent step remains exactly .
For an irreducible, aperiodic, and positive recurrent (i.e., ergodic) Markov chain, a unique stationary distribution exists, and the fundamental limit theorem applies:
Furthermore, the stationary probability is inversely proportional to the expected return time: . This provides a profound link between the limits of transition probabilities and the stochastic temporal behavior of the chain.
A gambler plays a fair game where they win $1 with probability $0.5$ and lose $1 with probability $0.5$ at each step. The gambler starts with $\$a$ and the game ends when their capital reaches $0$ (ruin) or a predetermined target value $\$N$ (success). This process can be seamlessly modeled as a discrete-time Markov chain with state space $S = \{0, 1, 2, \dots, N\}$ where states $0$ and $N$ represent the termination of the game.
While discrete-time Markov chains rigidly describe systems transitioning at fixed, discrete time steps, vastly many real-world stochastic processes change state at random, continuously distributed times along the axis. Such processes are modeled as Continuous-Time Markov Chains (CTMC).
A stochastic process defined on a discrete state space is a CTMC if it satisfies the strict continuous-time Markov property:
For a time-homogeneous CTMC, the transition probability only depends on the length of the time interval :
When a CTMC enters a state , the amount of time it spends in that state before making a sudden transition—called the holding time or sojourn time—strictly follows an exponential distribution with a rate parameter (often denoted or ).
Why an exponential distribution? The exponential distribution is the only strictly continuous probability distribution possessing the memoryless property. The Markov assumption fundamentally requires that the time already spent in a state yields zero new information about the remaining time to be spent in that state.
When the process inevitably leaves state , the probability it transitions specifically to state is independent of the holding time and is denoted by the transition probability , where and .
Equivalently, one specifies the unnormalized transition rates , defined precisely as the rate at which the continuous process transitions from state to state :
These transition rates are compactly arranged in the generator matrix (or infinitesimal generator) , whose scalar elements are given by:
Because of this specific continuous balancing formulation, the row sums of the generator matrix are identically across all rows:
In discrete time, matrices multiply simply via algebraic powers . In continuous time, the transition matrices satisfy systems of coupled linear differential equations instead of algebraic relations, linking the finite time transition probabilities to the instantaneous transition rates mathematically encoded in the matrix .
Kolmogorov Backward Equations: Component-wise, this elegantly expands to . These differential equations calculate probabilities by conditioning on the first transition out of the initial starting state.
Kolmogorov Forward Equations: Component-wise, this equates to . The forward equations construct the probability distribution by conditioning on the final transition immediately preceding time .
Provided sufficient regularity conditions (which automatically hold firm in all finite state spaces), the solution to these initial value problems (with boundary condition , the identity matrix) is given identically by the matrix exponential function:
Much like in DTMCs, under the correct irreducibility and positive-recurrence topological assumptions, a continuous-time Markov chain invariably possesses a stationary distribution governing the exact long-term steady-state proportion of time the process spends occupying each state.
However, the geometric algebraic condition is dynamically replaced by a differential equilibrium corresponding to a zero net rate of probability flux:
Here, remains a normalized probability vector with . The matrix equation corresponds exactly to a set of global balance equations stating firmly that the total probability flux leaving state strictly equals the total probability flux entering state from all other states combined.
This flux balance principle is absolutely foundational to modern queuing theory, stochastic chemical reaction networks, and biological population models, permanently bridging the highly abstract formulations of analytical probability into powerful mathematical tools used for rigorously evaluating complex dynamic system metrics over infinite continuous-time horizons.
A stochastic process is a mathematical object defined as a collection of random variables defined on a common probability space , indexed by a totally ordered set (usually representing time). Formally, a stochastic process is parameterized as , where for each , is an -measurable function mapping for measurable state space .
When or , the process is cast as a discrete-time stochastic process. If or , it represents a continuous-time stochastic process. The state space determines whether the process is discrete-state (e.g., integer values) or continuous-state (e.g., real-valued).
To rigorously describe the evolution of a stochastic process, it is essential to capture the accumulation of information over time. This is formalized by a filtration , which is an increasing family of sub--algebras of . That is, for all .
The intuitive interpretation of is the “history” or the “available information” up to time . A stochastic process is said to be adapted to the filtration if, for every , the random variable is -measurable. This implies that if one observes the state of the universe up to time , the value of is completely known.
Martingales constitute one of the most fundamental classes of stochastic processes, generalizing the concept of a “fair game” where knowledge of past events never helps predict expected future winnings.
Let be a filtered probability space. A real-valued stochastic process is a martingale with respect to the filtration and probability measure if it satisfies the following three conditions:
If the equality in the third condition is replaced with (or ), the process is termed a supermartingale (or submartingale). In a supermartingale, the expected future value is less than or equal to the current value (a losing game), whereas in a submartingale, it is greater than or equal to the current value (a winning game).
Consider a simple symmetric random walk , where the increments are independent, identically distributed (i.i.d.) random variables with and . Let be the natural filtration. Check that is a martingale:
Since is -measurable, . Since is independent of , . Thus, , proving is a discrete-time martingale.
In many practical and theoretical contexts, we are interested in evaluating models at random times (e.g., the time a stock hits a certain price or the time a gambler goes bankrupt). This gives rise to the concept of a stopping time.
A random variable is a stopping time (or Markov time) with respect to a filtration if, for every , the event . Intuitively, at any given time , one can determine whether the stopping time has occurred strictly based on the information available up to time . A stopping time cannot look into the future.
For a stochastic process , the first hitting time of a Borel set is defined as: When the process has right-continuous paths and is a closed set, is guaranteed to be a stopping time.
Does evaluating a martingale at a stopping time preserve its expected value? In general, it might not. However, Doob’s Optional Stopping Theorem establishes the conditions under which the expected value at the stopping time equals the initial expected value, i.e., .
Let be a discrete-time martingale and be a stopping time with respect to the filtration . Then holds if any of the following conditions is satisfied:
This theorem highlights the impossibility of formulating a systemic winning strategy in a fair game under bounded resource constraints (the origin of the impossibility of the classical “Martingale betting strategy”).
The Wiener process (or standard Brownian motion) is the fundamental continuous-time analog of the random walk. It drives modern financial theory, statistical mechanics, and continuous-state probability.
A standard one-dimensional Wiener process is a stochastic process characterized by the following properties:
Despite being continuous everywhere, the path of a Brownian motion is differentiable nowhere. Its quadratic variation over the interval is exactly . That is, . This strict non-zero quadratic variation is the very reason why ordinary calculus (Newton-Leibniz) fails for stochastic processes and necessitate a distinct calculus.
Read the code, make a small change, then run it and inspect the output. Runtime setup messages stay outside the terminal so the result remains focused on what the program prints.
Because Brownian motion has non-zero quadratic variation, the standard chain rule of differential calculus does not hold. Instead, we use Itô’s Calculus, anchored by Itô’s Lemma.
Let be an Itô drift-diffusion process satisfying the stochastic differential equation: where is a standard Wiener process, and are adapted processes. Let be a scalar function that is twice continuously differentiable in and once in (i.e., ).
By Itô’s Lemma, the process is also an Itô process whose differential is given by:
The profound emergence of the term reflects the quadratic variation of , often formalized by the heuristic multiplication rules:
In quantitative finance, the standard model for a stock price $S_t$ assumes the proportional return $dS_t / S_t$ undergoes constant drift and volatility, modeled by the stochastic differential equation: $dS_t = \mu S_t dt + \sigma S_t dW_t$. To find the distribution of $S_t$, we need to solve this. Applying standard ODE techniques fails because of the $dW_t$ term. We must use Itô's lemma to transform the equation, commonly via the natural logarithm function.
A Stochastic Differential Equation relates the continuous-time dynamics of a stochastic process to a deterministic drift part and a stochastic diffusion part. The general form is: This equation is simply a symbolic shorthand for the integral equation: where the first integral is a standard Lebesgue/Riemann integral and the second is an Itô stochastic integral.
Much like Picard–Lindelöf for deterministic ODEs, there are conditions for the strong existence and uniqueness of solutions to SDEs. Under Lipschitz continuity and linear growth bounding conditions:
for some constants and all , there exists a unique strong solution to the SDE.
The analysis, simulation, and integration of SDEs form the bedrock of continuously evolving systems subject to noise across physics, mathematical biology, and finance.
Regression analysis is a statistical method for estimating the relationships among variables. It focuses primarily on the relationship between a dependent variable (often called the response or outcome variable) and one or more independent variables (often called predictors, covariates, or explanatory variables). The objective is to understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed.
The most fundamental form of regression analysis is simple linear regression, which models the relationship between a single independent variable and a dependent variable . The true relationship is postulated to be a linear function of plus a stochastic error term.
The population model is defined as:
where:
For the standard estimation techniques to be valid and possess desirable statistical properties, certain assumptions regarding the error term must hold:
The most common method for estimating the unknown parameters and is Ordinary Least Squares (OLS). The OLS method chooses the estimates and that minimize the sum of the squared residuals (SSR).
The residual for the -th observation is the difference between the observed and the predicted value :
The Sum of Squared Residuals (SSR) is:
To minimize , we take the partial derivatives with respect to and and set them to zero:
Solving these normal equations yields the OLS estimators:
Where and are the sample means of and , respectively.
Multiple linear regression extends the simple linear model to include two or more independent variables. The model with predictors is written as:
Because writing out summations becomes unwieldy, multiple regression is almost universally represented using matrix algebra.
Let be an vector of observations of the dependent variable, be an matrix (the design matrix) where the first column is typically all 1s (for the intercept), be a vector of parameters, and be an vector of errors.
The OLS estimator vector minimizes . Expanding this and taking the derivative with respect to the vector yields the matrix formulation of the normal equations:
Assuming is invertible (which requires no perfect multicollinearity among the predictors), the OLS estimator is:
The Gauss-Markov theorem justifies the use of the OLS estimator. It states that under the classical linear regression model assumptions (linearity, strict exogeneity/independence, no perfect multicollinearity, and homoscedasticity), the OLS estimator is the Best Linear Unbiased Estimator (BLUE).
The variance-covariance matrix of the OLS estimator is: Where is the variance of the error term, typically estimated by , with being the vector of residuals.
To assess how well the model fits the data, we decompose the total variation in the dependent variable into explained and unexplained components.
The relationship is .
The statistic represents the proportion of variance in the dependent variable explained by the independent variables in the model.
While , adding more predictors to a model will mechanically never decrease , even if the predictors are irrelevant. To account for this, the Adjusted penalizes models for adding variables that do not significantly improve the fit:
Under the assumption that , the OLS estimators are normally distributed:
To test the hypothesis that a single independent variable has no effect on (i.e., ), a t-statistic is used: where is the standard error of the estimate, found directly from the square root of the -th diagonal element of the estimated variance-covariance matrix . Under the null hypothesis, this statistic follows a Student’s t-distribution with degrees of freedom.
To test the joint hypothesis that all slope coefficients (excluding the intercept) are simultaneously zero (), an F-statistic is constructed from the sums of squares: Under the null hypothesis, this follows an F-distribution with degrees of freedom. A large F-statistic provides evidence against the null hypothesis, indicating that at least one predictor variable is significantly related to the response variable.
A data scientist constructs a multiple linear regression model to predict the price of houses ($Y$, in thousands of dollars) based on square footage ($X_1$), age of the house ($X_2$, in years), and distance to the city center ($X_3$, in miles). The estimated model is $\hat{Y} = 150 + 0.2X_1 - 1.5X_2 - 5.0X_3$. The $R^2$ is 0.75, the Adjusted $R^2$ is 0.74, and the sample size is $n=100$. The standard error for $\hat\beta_2$ is $0.5$.
Estimation is only part of the process; structural validation ensures the model assumptions hold. Analyzing the residuals () is the primary tool for diagnostics.
where is the predicted value of the -th observation when the model is refitted without the -th observation. A high Cook’s distance indicates a highly influential data point.
Regression analysis serves as the foundational mathematical bedrock for predictive modeling and causal inference, bridging classical statistics to modern machine learning applications.
Analysis of Variance (ANOVA) is a collection of statistical models and their associated estimation procedures used to analyze the differences among group means in a sample. ANOVA was developed by statistician and evolutionary biologist Ronald Fisher. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the -test beyond two means.
While the -test is limited to comparing two groups, applying multiple -tests across several groups exponentially increases the Type I error rate (false positives). ANOVA controls this error rate by evaluating the entire set of groups simultaneously, partitioning the observed variance in a particular variable into components attributable to different sources of variation.
The fundamental mechanism of ANOVA is the partitioning of total variance into two primary components:
If the between-group variance is significantly larger than the within-group variance, it indicates that the independent variable has a significant effect on the dependent variable.
The validity of ANOVA relies on three core assumptions:
A One-Way ANOVA involves a single independent variable (factor) with three or more categorical levels. The model for an observation (the -th observation in the -th group) is given by:
Where:
The null hypothesis () states that all group population means are equal (or equivalently, all treatment effects are zero):
The alternative hypothesis () states that at least one population mean is different:
The Total Sum of Squares () is partitioned into the Sum of Squares Between () and the Sum of Squares Within (, also known as Error Sum of Squares, ).
Total Sum of Squares (SST) measures the total variation in the data: where is the grand mean.
Sum of Squares Between (SSB) measures the variation of group means around the grand mean: where is the mean of the -th group and is the number of observations in the -th group.
Sum of Squares Within (SSW) measures the variation of individual observations around their respective group means:
Degrees of freedom () are required to convert sums of squares into variances (mean squares). Let be the total sample size and be the number of groups.
The Mean Squares () are calculated by dividing the Sum of Squares by their respective degrees of freedom:
The test statistic for ANOVA is the ratio of the Mean Square Between to the Mean Square Within. Under the null hypothesis, both and are independent estimates of the population variance , so their ratio follows an -distribution with and degrees of freedom.
If the -statistic is significantly larger than 1 (specifically, greater than the critical value from the -distribution for a given alpha level), the null hypothesis is rejected.
A university aims to determine if three different teaching methods (Standard Lecture, Flipped Classroom, Problem-Based Learning) result in different final exam scores. 90 students are randomly assigned to the three methods (30 per method). The resulting Sum of Squares Between (SSB) is calculated as 450, and the Sum of Squares Within (SSW) is 2610.
A Two-Way ANOVA analyzes the effect of two independent categorical variables (factors) on a continuous dependent variable. It fundamentally differs from running two independent One-Way ANOVAs because it evaluates the interaction effect between the two variables.
The statistical model for a Two-Way ANOVA with factors and , fixed effects, and with replication ( observations per cell) is:
Where:
An interaction effect occurs when the effect of one independent variable on the dependent variable changes depending on the level of the other independent variable. Graphically, this is observed when the lines representing the means across levels of factors are not parallel (they may cross or diverge).
If the interaction effect is significant, interpreting the main effects (the individual effects of factor and factor ) becomes highly nuanced, as the main effects no longer fully describe the relationship.
In a balanced design (equal sample sizes in all cells), the total variance is partitioned into four orthogonal components:
Where:
Degrees of freedom are similarly partitioned: Let be the number of levels of Factor A, be the number of levels of Factor B, and the number of replicates per cell. Total observations .
Three distinct -tests are performed by dividing the corresponding Mean Square () by the Mean Square Error ():
A significant ANOVA only tells you that at least two means differ, not which means differ. To identify specific pairwise differences, post-hoc tests are required. Conducting multiple standard -tests inflates the family-wise error rate (the probability of making at least one Type I error across all tests).
where is the number of comparisons. For 5 groups, there are comparisons. If per test, the family-wise error rate jumps to (assuming independence, which is an oversimplification but illustrates the inflation).
The -value from an -test indicates statistical significance but not practical significance. Effect size metrics quantify the magnitude of the differences between groups.
Eta-squared represents the proportion of total variance in the dependent variable that is associated with membership in the different groups defined by the independent variable.
While intuitive, is an upwardly biased estimator of the population effect size (it tends to overestimate).
In multi-factor designs (like Two-Way ANOVA), can be misleading because the effects of one factor reduce the variance available to be explained by another. Partial eta-squared isolates the variance explained by a specific factor relative to the unexplained variance (error) and the variance of that specific factor.
Omega-squared is a more complex but unbiased estimator of the population variance explained. It corrects for the bias present in by incorporating degrees of freedom and Mean Square terms.
A researcher conducts a Two-Way ANOVA assessing the impact of Drug Dosage (A) and Therapy (B) on symptom reduction. The output yields the following sums of squares: SSA = 400, SSB = 100, SSAB = 50, SSE = 450. Total SST = 1000.
Repeated Measures ANOVA is the equivalent of the one-way ANOVA, but for related, not independent groups. It is the extension of the dependent (paired) -test. Examples include measuring the same participants across multiple time points (e.g., Blood pressure at baseline, week 1, and week 2) or exposing the same participants to all conditions in an experiment.
The key advantage of Repeated Measures ANOVA is that it removes variance attributable to individual differences from the Error Sum of Squares. This typically makes the analysis much more powerful (higher probability of detecting a true effect) than a standard independent-samples ANOVA.
Repeated measures designs require the assumption of Sphericity. Sphericity requires that the variances of the differences between all pairs of related groups are equal. It is evaluated using Mauchly’s Test of Sphericity.
If the assumption of sphericity is violated (Mauchly’s Test ), the Type I error rate inflates. To correct this, the degrees of freedom are adjusted downwards. Common corrections include:
If is close to 1, the sphericity assumption holds perfectly. The corrections effectively increase the critical -value required for significance by artificially reducing the degrees of freedom.
A time series is a sequence of data points indexed in time order. Formally, a time series is a stochastic process for , where is an index set, typically or for discrete-time time series. Analysis of time series involves understanding the underlying structure and function that produced the data, often for the purpose of forecasting future values.
The foundational assumption in many time series models is stationarity. A time series is strictly stationary if the joint distribution of is identical to that of for all .
In practice, strict stationarity is often too restrictive. Weak stationarity (or wide-sense stationarity) requires only that the first two moments are invariant with respect to time translation:
A sequence of uncorrelated random variables with mean zero and finite, constant variance is termed a white noise process, denoted . The autocovariance function for white noise is given by if , and otherwise. When the process consists of independent and identically distributed (i.i.d.) random variables, it is termed strictly white noise. Gaussian white noise assumes .
A random walk is defined by the process , where . Expanding this equation yields (assuming ). The expected value is , but the variance is . Because the variance is strictly dependent on , a random walk is non-stationary. The covariance between and (where ) is .
Linear time series models capture the linear dependencies between observations.
An autoregressive model of order , denoted AR(), models the current value as a linear combination of its previous values plus a white noise term: Using the backshift operator , where , the AR() model implies: where is the autoregressive polynomial. For an AR() process to be stationary, all roots of the characteristic equation must lie outside the unit circle in the complex plane (). For an AR(1) process , the condition simplifies to , yielding ACF .
A moving average model of order , denoted MA(), expresses as a linear combination of the current and previous white noise terms: Using the moving average polynomial , this is written as . Every finite-order MA process is stationary because it is a finite linear combination of stationary white noise processes. The autocovariance for , dictating that the ACF cuts off after lag .
Invertibility of an MA process ensures that it can be uniquely expressed as an infinite-order AR process. An MA() model is invertible if all roots of lie outside the unit circle.
Combining AR and MA concepts forms the Autoregressive Moving Average model, ARMA():
Stationarity and invertibility of the ARMA process depend on the roots of and respectively. Time series exhibiting non-stationarity in the mean, such as trends, require differencing. First-order differencing removes linear trends; second-order removes quadratic trends. Applying differences produces an Autoregressive Integrated Moving Average model, ARIMA():
You are building a time series model for daily foreign exchange rates between USD and EUR. The log daily prices P_t exhibit a wandering behavior resembling a random walk. When you plot the differences X_t = log(P_t) - log(P_{t-1}), the resulting series mean-reverts to zero. The ACF of X_t shows significant spikes at lags 1 and 2, but vanishes to zero afterwards. The Partial Autocorrelation Function (PACF) gradually decays toward zero.
While the ACF measures the linear dependence between and inclusive of intermediate effects, the Partial Autocorrelation Function (PACF) isolates the direct correlation. The PACF at lag , denoted , represents the correlation between and after removing the linear dependence of both variables on the intermediate values .
For an AR() process, the PACF cuts off strictly after lag ( for ). Conversely, for an MA() process, the PACF tails off gradually. This dualistic behavior provides the foundation for the Box-Jenkins model identification methodology.
Time domain analysis emphasizes serial correlations over time lags. Spectral analysis (frequency domain analysis) decomposes the variance of a time series over a continuous spectrum of angular frequencies . For a stationary process with autocovariance function , the spectral density function represents the Fourier transform of the autocovariance sequence: The total variance of the process corresponds to the integral over the frequency band: A peak at a specific frequency in the spectral density plot implies periodic behavior with cycle length . For Gaussian white noise, is absolute zero at all , rendering the spectral density perfectly flat: .
Filtering Operations in the frequency domain allow straightforward manipulation of time series signals. An LTI (Linear Time-Invariant) filter defined by sequence applies the convolution . The frequency response function of the filter is . The spectral density of the filtered output modifies according to:
When assessing joint dynamics of multiple interrelated time series , univariate ARIMA models are insufficient. The Vector Autoregressive model of order , VAR(), generalizes the AR structure to dimension : where are coefficient matrices and is a -dimensional multivariate white noise zero-mean vector strictly characterized by the covariance matrix .
Stationarity in a VAR system demands that roots of the determinant equation fall strictly outside the complex unit circle. VAR models naturally represent Granger causality: Granger-causes if the past observations of statistically improve the prediction horizon for compared to strict reliance on the isolated past of .
A more generalized analytic framework is provided by State-Space Modeling. A state-space model characterizes observation dynamics through an underlying, unobserved state variable sequence . The process divides into deterministic functional dependencies:
Here, specifies observation measurement noise, and structural transition disturbance. Matrices configure the parameters of dynamic correlation.
The Kalman filter supplies a recursive mechanism for determining the optimal minimum mean-squared error (MMSE) estimator for the state vector given the accrued observation sequence up to time , . The calculation iterates between the prediction step and optimal update (correction) computation involving the Kalman gain component modifying the prediction based on observed innovation error.
Standard parametric assumptions often fail mapping prolonged macroeconomic sequences due to fundamental shifts in generating mechanisms. A structural breakpoint models definitive shifts within the parameter spaces governing stationary dynamics. Formally evaluating structural sequence integrity requires analyzing sequence partitions mapping varying ARMA polynomials strictly restricted within designated time indices corresponding to systemic shocks.
Alternatively, Arch/GARCH frameworks directly model phenomena demonstrating localized heteroskedasticity. The Generalized Autoregressive Conditional Heteroskedasticity framework models the distinct variance sequence dynamically: The GARCH formulation precisely quantifies volatility clustering characterizations fundamentally essential to contemporary financial risk modeling frameworks.
Advanced paradigms increasingly rely upon threshold autoregressive paradigms (TAR) addressing non-linear functional manifestations, or fractional integration models (ARFIMA) structurally designed for mapping processes exhibiting exceptionally protracted long-range dependency characterized by exceptionally slowed hyperbolic ACF exponential decay functions.
Statistical inference often relies on parametric assumptions, specifically that the population from which the sample is drawn follows a known probability distribution, typically the normal distribution, characterized by a set of parameters (e.g., mean and variance ). Non-parametric statistics, in contrast, provide procedures for inferring properties of populations that do not rely on restrictive assumptions regarding the underlying parameterized probability distributions.
These methods are essential when sample sizes are small, data are ordinal or nominal, or severe departures from normality are evident. While non-parametric tests are more robust to distributional violations, they generally possess less statistical power compared to their parametric counterparts when the parametric assumptions are actually met.
The sign test is one of the simplest non-parametric tests, used to assess whether the median of a continuous distribution equals a hypothesized value . It is the non-parametric alternative to the one-sample t-test.
Let be a random sample from a continuous distribution with median . We wish to test the null hypothesis .
The test statistic is defined as the number of sample observations strictly greater than . Under , each observation has a 0.5 probability of being greater than , assuming continuity. Thus, follows a binomial distribution: where is the effective sample size, discarding any ties where .
For large (typically ), a normal approximation can be used: A continuity correction of is often applied to for greater accuracy.
The sign test ignores the magnitude of the differences between the observations and the hypothesized median. The Wilcoxon signed-rank test incorporates this magnitude, requiring the assumption that the underlying continuous distribution is symmetric about its median. It serves as a more powerful non-parametric alternative to the paired Student’s t-test or the one-sample t-test.
Given pairs of observations for , compute the differences .
Under (symmetric distribution about 0), the expected value and variance of are: For large , is approximately normally distributed, permitting the use of a -test.
When comparing two independent samples to determine if they originate from the same population, the Mann-Whitney U test (or Wilcoxon rank-sum test) offers a non-parametric alternative to the independent two-sample t-test. It assumes the two distributions are identical in shape but potentially shifted in location.
Let and be independent samples.
Under the null hypothesis that and have the same distribution, the expectation and variance of are: Ties in the data require an adjustment to the variance formula: where is the number of tied groups and is the number of observations in the -th tied group.
The Kruskal-Wallis H test extends the Mann-Whitney U test to more than two independent groups. It is the non-parametric equivalent of the one-way ANOVA, testing whether independent samples originate from the same distribution.
Given groups with sample sizes and total observations :
If the null hypothesis is true (all samples come from the same population) and the sample sizes are sufficiently large (typically ), is approximately distributed as a chi-square distribution with degrees of freedom: If the null hypothesis is rejected, post-hoc procedures like Dunn’s test are utilized for pairwise comparisons to isolate the specific stochastic dominance among groups.
Evaluating the strength and direction of association between two continuous or ordinal variables without assuming linearity relies on Spearman’s rank correlation coefficient ( or ). It evaluates the monotonic relationship between two variables, contrasting with Pearson’s correlation which evaluates linear relationships.
For pairs of observations , convert the raw scores to ranks and . Spearman’s is computed analogously to Pearson’s correlation coefficient, but applied to the ranks: where is the difference between the ranks of corresponding variables.
If there are identical values (ties), the simplified formula utilizing becomes inaccurate, and the standard Pearson correlation formula must be applied directly to the ranked variables.
Values of vary from to , indicating perfect negative or positive monotonic associations, respectively.
Modern computational power enables simulation-based non-parametric approaches, most notably bootstrapping. Introduced by Bradley Efron, bootstrapping relies on random sampling with replacement from the original dataset.
If we possess a sample drawn from an unknown distribution , we construct an empirical distribution function . By drawing repeated samples of size , with replacement, from , we generate bootstrap samples .
For a sample statistic estimating a parameter , we compute the statistic for each bootstrap sample: . The distribution of approximates the sampling distribution of , enabling the construction of confidence intervals and hypothesis testing lacking parametric form.
The bootstrap standard error is the standard deviation of the bootstrap replicates: where is the mean of the bootstrap estimates. Resampling procedures eliminate reliance on asymptotic normality assumptions, providing robust inferences particularly suitable for complex estimators or small sample sizes limit conventional asymptotic theory.
Kernel Density Estimation (KDE) establishes a non-parametric perspective on estimating the probability density function of a continuous random variable. Parametric estimation fits a predetermined shape (e.g., normal, gamma) parameterized by equations. KDE estimates the density entirely from data.
Let be independent and identically distributed samples drawn from some distribution with an unknown density . The kernel density estimator is: where constitutes the kernel (a non-negative function integrating to one) and denotes a smoothing parameter known as the bandwidth. The bandwidth heavily influences the estimator. Small induces undersmoothing, yielding high variance (spurious fluctuations), whereas large evokes oversmoothing, yielding high bias (obscuring structural features of the distribution). Standard choices for include the Gaussian, Epanechnikov, and uniform kernels.
Histograms and KDEs both attempt to model data density non-parametrically. Consider a dataset of highly clustered continuous physical measurements. A histogram forces boundaries at arbitrary bin edges. A KDE smooths out data without fixed bins.