Overview

Section Detail

Probability Theory

Probability theory is the mathematical framework for quantifying uncertainty. In its modern formulation, established by Andrey Kolmogorov in 1933, probability is rooted in measure theory, providing a rigorous foundation for statistical inference, stochastic processes, and information theory.

The Probability Space

A formal probability model is defined by a triplet $(\Omega, \mathcal{F}, P)$ , known as a probability space. Each component of this triplet serves a distinct mathematical purpose in capturing the structure of random phenomena.

The sample space $\Omega$ is a non-empty set containing all possible outcomes of an experiment. An element $\omega \in \Omega$ represents a single, highly specific outcome.

The event space $\mathcal{F}$ is a $\sigma$ -algebra on $\Omega$ . A collection of subsets $\mathcal{F} \subseteq 2^\Omega$ is a $\sigma$ -algebra if it satisfies three conditions:

$\Omega \in \mathcal{F}$ .
If $A \in \mathcal{F}$ , then its complement $A^c \in \mathcal{F}$ .
If $A_1, A_2, \dots \in \mathcal{F}$ , then their countable union $\bigcup_{i=1}^\infty A_i \in \mathcal{F}$ .

Elements of $\mathcal{F}$ are called events. The restriction to a $\sigma$ -algebra (rather than the entire power set $2^\Omega$ ) is mathematically necessary when dealing with uncountably infinite sample spaces, such as the real line $\mathbb{R}$ , to avoid paradoxes associated with non-measurable sets (e.g., the Banach-Tarski paradox).

The probability measure $P$ is a function $P: \mathcal{F} \to [0, 1]$ satisfying Kolmogorov’s axioms:

Non-negativity: $P(A) \ge 0$ for all $A \in \mathcal{F}$ .
Unit measure: $P(\Omega) = 1$ .
Countable additivity: For any countable sequence of pairwise disjoint events $A_1, A_2, \dots$ (where $A_i \cap A_j = \emptyset$ for $i \neq j$ ), $P\left(\bigcup_{i=1}^\infty A_i\right) = \sum_{i=1}^\infty P(A_i)$

From these axioms, foundational properties emerge seamlessly. For example, the probability of the empty set must be $0$ . Since $\Omega$ and $\emptyset$ are disjoint and $\Omega \cup \emptyset = \Omega$ , we have $P(\Omega) = P(\Omega) + P(\emptyset) \implies 1 = 1 + P(\emptyset) \implies P(\emptyset) = 0$ .

Which of the following is NOT required for a collection of subsets to form a \sigma-algebra?

Independence and Conditional Probability

Two events $A$ and $B$ are independent if the occurrence of one does not alter the probability of the other. Mathematically, this is defined as: $P(A \cap B) = P(A)P(B)$

When events are not independent, partial information changes our uncertainty. The conditional probability of an event $A$ given that event $B$ has occurred (with $P(B) > 0$ ) is defined as: $P(A \mid B) = \frac{P(A \cap B)}{P(B)}$

Rearranging this definition yields the multiplication rule $P(A \cap B) = P(A \mid B)P(B)$ . This straightforward algebraic manipulation leads to Bayes’ Theorem, a foundational result tying forward and inverse probabilities: $P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}$

The denominator $P(B)$ is often expanded using the Law of Total Probability. For a partition $A_1, A_2, \dots, A_n$ of the sample space $\Omega$ , we have: $P(B) = \sum_{i=1}^n P(B \mid A_i)P(A_i)$

Medical Testing Accuracy

A disease affects 1% of a population. A diagnostic test correctly identifies the disease 99% of the time when a patient is infected (true positive). However, it also incorrectly indicates disease 5% of the time for healthy patients (false positive). A randomly selected individual tests positive.

Using Bayes' Theorem, what is the exact probability that the individual actually has the disease?

Random Variables and Distributions

A random variable is not a variable, nor is it inherently random. It is a deterministic function $X: \Omega \to \mathbb{R}$ that maps outcomes to real numbers. Crucially, $X$ must be a measurable function. This means that for any Borel set $B \subseteq \mathbb{R}$ , its preimage must be an event in our $\sigma$ -algebra: $X^{-1}(B) = \{ \omega \in \Omega : X(\omega) \in B \} \in \mathcal{F}$

The probability distribution of $X$ is completely determined by its Cumulative Distribution Function (CDF), $F_X(x)$ , defined as: $F_X(x) = P(X \le x) = P(\{ \omega \in \Omega : X(\omega) \le x \})$ Every valid CDF is right-continuous, monotonically non-decreasing, with $\lim_{x \to -\infty} F_X(x) = 0$ and $\lim_{x \to \infty} F_X(x) = 1$ .

Discrete vs. Continuous Distributions

A random variable is discrete if it takes values in a countable set. It is described by a Probability Mass Function (PMF) $p_X(x) = P(X = x)$ . A random variable is continuous if there exists a non-negative Lebesgue-integrable function $f_X(x)$ , called the Probability Density Function (PDF), such that: $F_X(x) = \int_{-\infty}^{x} f_X(t) \, dt$ For continuous variables, the probability of any single precise point is strictly zero: $P(X=x) = 0$ . Probabilities are only assigned to intervals.

Which of the following statements about the Cumulative Distribution Function (CDF) is always mathematically accurate for any random variable?

Expected Value: The Lebesgue Perspective

The expected value $\mathbb{E}[X]$ of a random variable is the probability-weighted average of all its possible values. In an elementary context, it is formulated as a sum for discrete variables $\sum x_i p(x_i)$ and a Riemann integral for continuous variables $\int x f(x) dx$ .

A more unified, rigorous approach utilizes the Lebesgue integral over the probability space: $\mathbb{E}[X] = \int_{\Omega} X(\omega) \, dP(\omega)$ This single definition naturally covers discrete, continuous, and mixed random variables, treating probability distributions simply as specific measures.

The expected value possesses the critical property of linearity. For any random variables $X$ and $Y$ , and constants $a, b \in \mathbb{R}$ : $\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$ Linearity holds identically whether $X$ and $Y$ are independent or heavily correlated.

Variance and Moments

To quantify the dispersion or spread of a probability distribution around its center, we examine the second central moment, the variance: $\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$ The variance strictly requires that $\mathbb{E}[X^2]$ (the second moment) is finite. Unlike expectation, variance is not a linear operator. For constants $a, b$ : $\text{Var}(aX + b) = a^2 \text{Var}(X)$ For the sum of two random variables, the variance is given by: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$ If $X$ and $Y$ are independent, their covariance $\text{Cov}(X, Y)$ is zero, rendering the variance strictly additive.

Linear Transformations of Portfolios

A quantitative analyst models the daily return of two technology stocks, A and B. Both stocks have an expected daily return of 2% and a standard deviation of 4%. The stocks are perfectly uncorrelated. The analyst constructs a portfolio that heavily weights stock A: they hold $3 worth of Stock A and -$1 worth of Stock B (a short position) to hedge.

What is the variance of the daily return of this portfolio P = 3A - 1B?

Limits and Asymptotic Theorems

The utility of a single measure or expectation dramatically extrapolates as we consider sequences of random variables $X_1, X_2, \dots$ Often, we are concerned with sums of independent and identically distributed (i.i.d.) random variables.

Two foundational theorems act as the bedrock for modern statistics.

Law of Large Numbers (LLN): Let $X_1, X_2, \dots, X_n$ be an i.i.d. sequence of random variables with finite expectation $\mu$ . The sample average $\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i$ converges to the expected value $\mu$ . The Strong Law ensures almost sure convergence ( $\Pr(\lim_{n \to \infty} \bar{X}_n = \mu) = 1$ ), whereas the Weak Law guarantees convergence in probability.
Central Limit Theorem (CLT): If the sequence also possesses a finite variance $\sigma^2 > 0$ , the standardized sample average converges in distribution to the standard normal distribution $\mathcal{N}(0,1)$ : $\lim_{n \to \infty} P \left( \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \le z \right) = \Phi(z)$ where $\Phi(z)$ is the CDF of the standard normal distribution.

The sheer power of the CLT stems from a distinct lack of distributional assumptions: regardless of whether the original variable $X$ is discrete, highly skewed, or uniform, the aggregate behavior of sums mathematically mandates a metamorphosis into the bell curve, underpinning almost all large-scale modeling and parametric tests.

What is the primary condition required by the Central Limit Theorem for the sample average sequence to converge to a normal distribution?

Section Detail

Hypothesis Testing

Hypothesis testing is a formal mathematical framework for making inferential decisions about population parameters based on sample data. It provides a structured methodology to evaluate whether observed data yields sufficient evidence to reject a predefined baseline assumption.

The Null and Alternative Hypotheses

The foundation of any statistical test consists of two mutually exclusive statements about a population parameter: the null hypothesis ( $H_0$ ) and the alternative hypothesis ( $H_a$ or $H_1$ ).

The null hypothesis ( $H_0$ ) typically represents a state of no effect, no difference, or the historical baseline. It is the hypothesis that is assumed true until statistical evidence indicates otherwise.

The alternative hypothesis ( $H_a$ ) represents the claim or theory that the researcher asserts is true, provided the sample data provides sufficient evidence to reject $H_0$ .

For a population mean $\mu$ evaluated against a hypothesized value $\mu_0$ , tests are formulated in one of three ways:

Two-tailed test: $H_0: \mu = \mu_0 \quad \text{vs.} \quad H_a: \mu \neq \mu_0$
Right-tailed test (Upper-tailed): $H_0: \mu \le \mu_0 \quad \text{vs.} \quad H_a: \mu > \mu_0$
Left-tailed test (Lower-tailed): $H_0: \mu \ge \mu_0 \quad \text{vs.} \quad H_a: \mu < \mu_0$

The objective of the testing procedure is not to computationally “prove” $H_0$ , but rather to determine if there is enough evidence to reject it in favor of $H_a$ .

Decision Errors in Inference

Because hypothesis testing relies on sample data rather than an exhaustive population census, inferential decisions are subject to probabilistic errors.

Type I Error ( $\alpha$ )

A Type I Error occurs when the null hypothesis is rejected when it is, in fact, true in the population. This is equivalent to a false positive. The probability of committing a Type I error is denoted by $\alpha$ , which is also strictly defined as the significance level of the test.

$\alpha = P(\text{Reject } H_0 \mid H_0 \text{ is true})$

Type II Error ( $\beta$ )

A Type II Error occurs when the null hypothesis is not rejected when the alternative hypothesis is true. This is a false negative. The probability of a Type II error is denoted by $\beta$ .

$\beta = P(\text{Fail to reject } H_0 \mid H_a \text{ is true})$

In a criminal trial setting where $H_0$ is 'the defendant is innocent', what is the consequence of a Type I error?

Statistical Power

The power of a statistical test is the probability of correctly rejecting a false null hypothesis. It is the compliment of the Type II error rate.

$\text{Power} = 1 - \beta = P(\text{Reject } H_0 \mid H_a \text{ is true})$

Power depends on several factors: the significance level $\alpha$ , the sample size $n$ , the true effect size (the magnitude of the difference between the true parameter and $\mu_0$ ), and the population variance $\sigma^2$ . Increasing sample size generally increases the power of a test.

Test Statistics and the Z-Test

A test statistic is a standardized value calculated from sample data during a hypothesis test. It measures the degree of agreement between the sample data and the null hypothesis.

Consider testing the mean of a normally distributed population with a known variance $\sigma^2$ . Let $X_1, X_2, \dots, X_n$ be an independent and identically distributed (i.i.d.) random sample from $N(\mu, \sigma^2)$ . The sample mean $\bar{X}$ follows a normal distribution:

$\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)$

Under the null hypothesis $H_0: \mu = \mu_0$ , the test statistic $Z$ is constructed by standardizing $\bar{X}$ :

$Z = \frac{\bar{X} - \mu_0}{\frac{\sigma}{\sqrt{n}}}$

If $H_0$ is true, the test statistic $Z$ follows a standard normal distribution, $Z \sim N(0, 1)$ . This distribution governs the probability of observing the test statistic.

The Rejection Region (Critical Value Approach)

The rejection region is the set of values for the test statistic that leads to the rejection of $H_0$ . Its boundaries are determined by the critical values, which depend on the pre-specified significance level $\alpha$ and the directionality of the test.

For a two-tailed test at significance level $\alpha$ , the critical values are $\pm z_{\alpha/2}$ . The decision rule is: Reject $H_0$ if $|Z| > z_{\alpha/2}$ .

For instance, when $\alpha = 0.05$ , $z_{0.025} \approx 1.96$ . Therefore, if the calculated $Z$ falls outside the interval $[-1.96, 1.96]$ , $H_0$ is rejected.

Manufacturing Quality Control

A factory produces steel cables with a specified mean breaking strength of $10,000$ N and a known standard deviation of $400$ N. A quality control engineer suspects the machinery needs calibration and takes a random sample of $n = 50$ cables. The sample mean breaking strength is $9,880$ N. The engineer runs a two-tailed hypothesis test with $\alpha = 0.05$.

Based on the sample data, what is the value of the test statistic $Z$, and does the engineer reject the null hypothesis?

The P-Value Approach

Modern statistical software generally reports the p-value, an alternative to the critical value approach that provides more granular information regarding the strength of the evidence against $H_0$ .

The p-value is defined as the probability, calculated under the assumption that the null hypothesis is true, of obtaining a test statistic at least as extreme as the one actually observed.

For the standard normal test statistic $Z_{obs}$ :

Two-tailed test: $p = 2 \cdot P(Z \ge |Z_{obs}|)$
Right-tailed test: $p = P(Z \ge Z_{obs})$
Left-tailed test: $p = P(Z \le Z_{obs})$

Decision Rule:

If $p \leq \alpha$ , reject $H_0$ .
If $p > \alpha$ , fail to reject $H_0$ .

A smaller p-value constitutes stronger evidence against the null hypothesis. It is crucial to note that the p-value is not the probability that the null hypothesis is true ( $P(H_0 \mid \text{data})$ ). It is the probability of the data given the null hypothesis ( $P(\text{data} \mid H_0)$ ).

A researcher conducts a hypothesis test and obtains a p-value of 0.034. Does this mean there is a 3.4% chance that the null hypothesis is true?

The Student’s t-Test

In practical applications, the population variance $\sigma^2$ is almost always unknown. Replacing the population standard deviation $\sigma$ with the sample standard deviation $s$ changes the distribution of the test statistic.

When $X_1, \dots, X_n \sim N(\mu, \sigma^2)$ but $\sigma$ is unknown, the test statistic follows a Student’s t-distribution with $n - 1$ degrees of freedom ( $df$ ):

$t = \frac{\bar{X} - \mu_0}{\frac{s}{\sqrt{n}}} \sim t_{n-1}$

The t-distribution is symmetric and bell-shaped like the standard normal distribution but possesses heavier tails. These heavier tails artificially introduce more probability in the extremes, accounting for the additional uncertainty incurred by estimating continuous variance from a finite sample. As $n \to \infty$ , the t-distribution converges to the standard normal distribution $N(0,1)$ .

Multiple Hypothesis Testing

When conducting multiple hypothesis tests simultaneously on a single dataset, the probability of committing at least one Type I error compounds. If a researcher conducts $m$ independent tests each at significance level $\alpha$ , the family-wise error rate (FWER)—the probability of making one or more false discoveries—is given by:

$\text{FWER} = 1 - (1 - \alpha)^m$

For example, performing 20 tests at $\alpha = 0.05$ yields an FWER of $\approx 0.64$ . Without correction, false positives are extremely likely.

The Bonferroni Correction

The most conservative method to control the FWER is the Bonferroni correction. To maintain a given family-wise $\alpha_{FWER}$ , each individual test is evaluated at a newly adjusted significance level:

$\alpha_{individual} = \frac{\alpha_{FWER}}{m}$

If 20 tests are conducted and the desired global false positive rate is 5%, each individual p-value must be compared against $\alpha_{individual} = 0.05 / 20 = 0.0025$ .

While mathematically rigorous and guaranteed to bound the FWER under all forms of dependence among tests, the Bonferroni strictly reduces statistical power, exponentially increasing Type II error rates when the number of tests ( $m$ ) is massive, as is common in genomics and machine learning algorithms.

Statistics

Section Detail

Statistical Inference

Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution. Whereas probability theory deduces the behavior of a sample given known population parameters, statistical inference deduces the population parameters based on an observed sample.

Formally, we observe a sample $\mathbf{X} = (X_1, X_2, \dots, X_n)$ which we assume is generated from a probability model belonging to a known family of distributions $\mathcal{P} = \lbrace P_\theta : \theta \in \Theta \rbrace$ , where $\theta$ is an unknown parameter vector and $\Theta$ is the parameter space. The objective is to estimate $\theta$ or make decisions about it.

Point Estimation

A point estimator $\hat{\theta}$ is any statistic (a function of the data $\mathbf{X}$ that does not depend on any unknown parameters) used to infer the value of an unknown parameter $\theta$ in a statistical model. We denote the estimator as $\hat{\theta}(\mathbf{X})$ and the estimate (the realized value for a specific sample $\mathbf{x}$ ) as $\hat{\theta}(\mathbf{x})$ .

Desirable Properties of Point Estimators

How do we decide if an estimator $\hat{\theta}$ is “good”? We evaluate its statistical properties across all possible samples of size $n$ .

1. Unbiasedness: An estimator $\hat{\theta}$ is unbiased for $\theta$ if its expected value over all possible samples equals the true parameter value: $\mathbb{E}_\theta[\hat{\theta}] = \theta \quad \forall \theta \in \Theta$ The bias of an estimator is defined as $\text{Bias}(\hat{\theta}) = \mathbb{E}_\theta[\hat{\theta}] - \theta$ . While unbiasedness is intuitively appealing, it is not always strictly necessary, especially if allowing a small bias significantly reduces the estimation error.

2. Mean Squared Error (MSE): A common measure of the quality of an estimator is its Mean Squared Error: $\text{MSE}(\hat{\theta}) = \mathbb{E}_\theta[(\hat{\theta} - \theta)^2]$ Using the definitions of variance and bias, the MSE can be decomposed into: $\text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + (\text{Bias}(\hat{\theta}))^2$ If an estimator is unbiased, its MSE is exactly its variance.

3. Consistency: An estimator $\hat{\theta}_n$ (subscript $n$ emphasizes dependence on sample size) is consistent if it converges in probability to the true parameter value as the sample size $n \to \infty$ : $\forall \epsilon > 0, \lim_{n \to \infty} P_\theta(|\hat{\theta}_n - \theta| > \epsilon) = 0$ Consistency means that with an infinitely large amount of data, the estimator perfectly pinpoints the underlying parameter.

Method of Moments

The Method of Moments (MoM) is one of the oldest methods of deriving point estimators. It is based on equating the sample moments to the population moments, thereby obtaining a system of equations to solve for the unknown parameters.

The $k$ -th population moment is a function of the parameter vector $\theta$ : $\mu_k(\theta) = \mathbb{E}_\theta[X^k]$ The $k$ -th sample moment is calculated from the data: $m_k = \frac{1}{n} \sum_{i=1}^n X_i^k$

If we have $p$ unknown parameters, $\theta = (\theta_1, \theta_2, \dots, \theta_p)$ , we set up a system of $p$ equations: $\mu_j(\theta_1, \dots, \theta_p) = m_j \quad \text{for } j = 1, 2, \dots, p$ Solving this system yields the Method of Moments estimator $\hat{\theta}_{MoM}$ .

Consider a sample $X_1, \dots, X_n$ from a continuous Uniform $(0, \theta)$ distribution. What is the expected value $\mathbb{E}[X]$ and the corresponding Method of Moments estimator for $\theta$?

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation is a formal, unified approach to parameter estimation. It frames estimation as finding the parameter value that makes the observed data “most probable” or “most likely” to have occurred.

Let $f(x \mid \theta)$ be the probability density function (PDF) or probability mass function (PMF) of our distribution. Given an observed sample $\mathbf{x} = (x_1, \dots, x_n)$ of independent and identically distributed (i.i.d.) random variables, the likelihood function is the joint density evaluated at the observed data, viewed as a function of the parameter $\theta$ : $L(\theta \mid \mathbf{x}) = \prod_{i=1}^n f(x_i \mid \theta)$

The Maximum Likelihood Estimator $\hat{\theta}_{MLE}$ is the value $\theta \in \Theta$ that maximizes $L(\theta \mid \mathbf{x})$ . Because the natural logarithm is a strictly increasing function, it is computationally and analytically easier to maximize the log-likelihood function: $\ell(\theta \mid \mathbf{x}) = \ln L(\theta \mid \mathbf{x}) = \sum_{i=1}^n \ln f(x_i \mid \theta)$

Assuming standard regularity conditions (e.g., differentiability with respect to $\theta$ and the support of the distribution not depending on $\theta$ ), the MLE can be found by solving the score equation: $\frac{\partial}{\partial \theta} \ell(\theta \mid \mathbf{x}) = 0$ and verifying that the second derivative is negative (Concavity).

Properties of the MLE

Under mild regularity conditions, the MLE has remarkable asymptotic properties:

Consistency: $\hat{\theta}_{MLE} \xrightarrow{p} \theta$ .
Equivariance: If $g(\theta)$ is a function of $\theta$ , then the MLE of $g(\theta)$ is $g(\hat{\theta}_{MLE})$ .
Asymptotic Normality and Efficiency: The distribution of the MLE approaches a Normal distribution as $n \to \infty$ , and its asymptotic variance is the lowest possible variance among all consistent estimators (it achieves the Cramér-Rao lower bound asymptotically). $\sqrt{n}(\hat{\theta}_{MLE} - \theta) \xrightarrow{d} \mathcal{N}(0, I(\theta)^{-1})$ where $I(\theta)$ is the Fisher Information.

MLE vs MoM on the Uniform Distribution

We have an i.i.d. sample $X_1, X_2, \dots, X_n$ from $U(0, \\theta)$. We previously saw that the MoM estimator is $\\hat{\\theta}_{MoM} = 2\\bar{X}$. Now, let's derive the MLE.

Determine the MLE $\\hat{\\theta}_{MLE}$ and compare it to the MoM estimator.

Sufficiency

A statistic $T(\mathbf{X})$ is sufficient for $\theta$ if the conditional distribution of the sample $\mathbf{X}$ given $T(\mathbf{X})$ does not depend on $\theta$ . Intuitively, $T(\mathbf{X})$ contains all the information in the sample about $\theta$ ; no other function of the data can provide further insights regarding the value of $\theta$ .

Proving sufficiency directly via conditional probabilities can be tedious. Instead, we use the Fisher-Neyman Factorization Theorem: A statistic $T(\mathbf{X})$ is sufficient for $\theta$ if and only if the joint PDF (or PMF) of the sample can be factored into two components: $f(\mathbf{x} \mid \theta) = g(T(\mathbf{x}) \mid \theta) \cdot h(\mathbf{x})$ where $h(\mathbf{x})$ is a non-negative function that depends only on the data, and $g(T(\mathbf{x}) \mid \theta)$ is a non-negative function that depends on the parameter $\theta$ and the data $\mathbf{x}$ strictly through the statistic $T(\mathbf{x})$ .

The Rao-Blackwell Theorem

Sufficiency plays a vital role in optimal estimation. The Rao-Blackwell theorem formalizes this: if you have an unbiased estimator $\hat{\theta}$ and a sufficient statistic $T$ , the conditional expectation $\mathbb{E}[\hat{\theta} \mid T]$ defines a new estimator that is also unbiased and has a variance less than or equal to the variance of the original estimator $\hat{\theta}$ . Conclusively, optimal estimators should always be functions of a sufficient statistic.

The Cramér-Rao Lower Bound (CRLB)

When developing estimators, mathematical statisticians want to know the absolute best possible variance an unbiased estimator can achieve. Does an absolute limit exist, beyond which no estimator can improve?

Yes, under regularity conditions (primarily that the parameter space is an open interval and the support does not depend on $\theta$ ), the Cramér-Rao Lower Bound places a theoretical lower limit on the variance of any unbiased estimator $W(\mathbf{X})$ of a parameter $\tau(\theta)$ : $\text{Var}_\theta(W(\mathbf{X})) \ge \frac{[\tau'(\theta)]^2}{n I(\theta)}$ where $I(\theta)$ is the Fisher Information defined as: $I(\theta) = \mathbb{E}_\theta \left[ \left( \frac{\partial}{\partial \theta} \ln f(X \mid \theta) \right)^2 \right] = -\mathbb{E}_\theta \left[ \frac{\partial^2}{\partial \theta^2} \ln f(X \mid \theta) \right]$

If the variance of an unbiased estimator exactly equals the Cramér-Rao lower bound, it is deemed efficient (simultaneously proving it is the Uniformly Minimum Variance Unbiased Estimator - UMVUE). As noted earlier, Maximum Likelihood Estimators asymptotically achieve this lower bound, validating their massive prevalence in modern statistics.

Confidence Intervals Construction

While point estimators output a single best guess for a parameter ( $\hat{\theta}$ ), interval estimators yield a range of plausible values constructed such that the random interval covers the true parameter $\theta$ with a specified probability $1-\alpha$ , referred to as the confidence level.

Formally, a $1-\alpha$ confidence interval for $\theta$ is defined by two random variables $L(\mathbf{X})$ and $U(\mathbf{X})$ such that: $P_\theta\left( L(\mathbf{X}) \le \theta \le U(\mathbf{X}) \right) \ge 1 - \alpha \quad \forall \theta \in \Theta$

The Pivot Method

The most common technique to systematically derive confidence intervals relies on finding a pivotal quantity (or “pivot”). A random variable $Q(\mathbf{X}; \theta)$ is a pivot if:

It is a function of the sample $\mathbf{X}$ and the unknown parameter $\theta$ .
The probability distribution of $Q(\mathbf{X}; \theta)$ is completely independent of $\theta$ and any other unknown parameters.

If a pivot exists, constructing an interval estimator proceeds straightforwradly by finding constants $q_{\alpha/2}$ and $q_{1-\alpha/2}$ from the known distribution of $Q$ such that: $P\left( q_{\alpha/2} \le Q(\mathbf{X}; \theta) \le q_{1-\alpha/2} \right) = 1 - \alpha$ We then algebraically invert the inequalities inside the probability statement to isolate $\theta$ in the center: $P\left( L(\mathbf{X}) \le \theta \le U(\mathbf{X}) \right) = 1 - \alpha$

Deriving a Normal Confidence Interval via a Pivot

We have a sample $X_1, \\dots, X_n$ from a Normal distribution $\\mathcal{N}(\\mu, \\sigma^2)$ where variance $\\sigma^2$ is known and we must determine a $1-\\alpha$ confidence interval for the mean $\\mu$.

Define a suitable pivotal quantity and construct the confidence interval.

Asymptotic Confidence Intervals

When finite-sample pivot methods are intractable, statisticians leverage the asymptotical distribution of the Maximum Likelihood Estimator to construct approximate confidence regions. Since $\sqrt{n}(\hat{\theta}_{MLE} - \theta) \xrightarrow{d} \mathcal{N}(0, I(\hat{\theta}_{MLE})^{-1})$ , where $I(\hat{\theta}_{MLE})$ is the observed Fisher Information evaluated at the MLE, we use the asymptotic standard error: $SE(\hat{\theta}_{MLE}) \approx \frac{1}{\sqrt{n \cdot I(\hat{\theta}_{MLE})}}$ This yields the standard large-sample Wald confidence interval of the form: $\hat{\theta}_{MLE} \pm z_{\alpha/2} \cdot SE(\hat{\theta}_{MLE})$

Why can it be problematic to state 'There is a 95% probability that the true parameter lies between 4.2 and 5.8' after substituting data to calculate a 95% confidence interval [4.2, 5.8] from a given dataset?

Summary of Estimator Selection

Modern inference requires balancing various optimization properties:

Can we find an exact pivot for an interval, or must we rely on large sample sizes and Wald intervals?
Will the bias inherent to the MLE decay rapidly via consistency?
In highly complicated distributions where MLEs are not analytically enclosed, Method of Moments can act as a viable starting guess for numerical integration of the likelihood function.

Statistical inference provides the comprehensive foundation for drawing meaningful, mathematically strict conclusions from randomized noisy data under uncertainty.

Section Detail

Bayesian Statistics

Frequentist statistics interprets probability strictly as the long-run expected frequency of repeatable events. Bayesian statistics interprets probability fundamentally differently: as a degree of belief or a quantification of uncertainty. The Bayesian paradigm provides a rigorous mathematical framework for evaluating and updating our state of knowledge as new data becomes available.

The Foundation: Bayes’ Theorem

The core operating principle of Bayesian inference is Bayes’ Theorem, a mathematical identity derived from the definition of conditional probability:

$P(H|D) = \frac{P(D|H) \cdot P(H)}{P(D)}$

Where:

$P(H|D)$ (Posterior): The probability of the hypothesis $H$ after observing data $D$ . This represents the updated state of belief.
$P(D|H)$ (Likelihood): The probability of observing the data $D$ assuming the hypothesis $H$ is true. This quantifies the evidence generated by the data.
$P(H)$ (Prior): The initial degree of belief in the hypothesis $H$ before observing the data $D$ .
$P(D)$ (Evidence or Marginal Likelihood): The total probability of observing the data across all possible hypotheses. It acts as a normalizing constant to ensure the posterior is a valid probability distribution: $P(D) = \int P(D|H)P(H)dH$ .

Because the denominator $P(D)$ does not depend on $H$ , Bayes’ theorem is often written as a proportionality:

$\text{Posterior} \propto \text{Likelihood} \times \text{Prior}$

Frequentist vs. Bayesian Comparison

The differences between the two schools of thought run deep, impacting how inference is conducted and interpreted.

Parameters: In frequentist statistics, parameters (like the true mean $\mu$ of a population) are fixed but unknown constants. In Bayesian statistics, parameters are treated as random variables described by probability distributions.
Data: Frequentists view the observed data as one possible realization from an infinite sequence of hypothetical repetitions. Bayesians treat the observed data as fixed and use it to calculate the probability of the parameter taking on various values.
Confidence Intervals vs. Credible Intervals: A frequentist 95% confidence interval means that if the experiment were repeated infinitely, 95% of the constructed intervals would contain the fixed parameter. A Bayesian 95% credible interval directly means there is a 95% probability that the parameter lies within that interval, given the observed data and prior belief.

The Role and Selection of Priors

The choice of the prior distribution $P(H)$ is a critical and sometimes criticized aspect of Bayesian analysis. Priors encode expert knowledge and initial assumptions.

Informative vs. Uninformative Priors

An informative prior asserts specific, strong beliefs about the parameter space. For example, if measuring human height, a prior tightly clustered around $1.7$ meters is highly informative. An uninformative (or diffuse) prior spreads probability mass across the parameter space, attempting to let the data “speak for itself.” A uniform distribution is a common example, though true non-informativeness is mathematically subtle.

Conjugate Priors

A prior is conjugate to a specific likelihood function if the resulting posterior distribution belongs to the same probability family as the prior. Conjugacy provides immense mathematical convenience because the posterior can be derived algebraically without complex numerical integration.

Examples of natural conjugate pairs include:

Beta Prior & Binomial Likelihood $\rightarrow$ Beta Posterior. (Used for probabilities and proportions).
Normal Prior & Normal Likelihood (known variance) $\rightarrow$ Normal Posterior. (Used for continuous mean estimation).
Gamma Prior & Poisson Likelihood $\rightarrow$ Gamma Posterior. (Used for rate parameter estimation).

Consider the Beta-Binomial model. If the prior for the probability of success $\theta$ is $\text{Beta}(\alpha, \beta)$ and the newly observed data $D$ contains $y$ successes and $n-y$ failures, the posterior is simply:

$P(\theta | y) \sim \text{Beta}(\alpha + y, \beta + n - y)$

Jeffreys Prior

When seeking an uninformative prior, a flat uniform distribution can be problematic because it is not invariant under parameter transformations (e.g., a uniform prior on the standard deviation $\sigma$ is not uniform on the variance $\sigma^2$ ). The Jeffreys Prior solves this by deriving the prior directly from the Fisher Information $I(\theta)$ of the likelihood function:

$P(\theta) \propto \sqrt{\det(I(\theta))}$

This guarantees that the prior remains uninformative regardless of how the parameter is parameterized mathematically.

Computational Bayesian Inference: MCMC and Gibbs Sampling

Historically, the difficulty of computing the normalizing constant $P(D)$ analytically restricted Bayesian methods to conjugate models. The advent of modern computing and Markov Chain Monte Carlo (MCMC) algorithms revolutionized Bayesian statistics, allowing inference on virtually any model.

Markov Chain Monte Carlo

MCMC algorithms do not attempt to calculate the posterior distribution analytically. Instead, they draw a vast number of correlated samples directly from the posterior space. By analyzing these samples (e.g., taking the mean, variance, or percentiles of the samples), we can estimate the properties of the posterior distribution.

The algorithm constructs a Markov Chain—a sequence of states where the next state depends only on the current state—designed such that its stationary distribution is exactly the target posterior distribution.

Gibbs Sampling

A specialized and highly effective MCMC algorithm for multi-dimensional parameter spaces is Gibbs Sampling. Instead of trying to update all parameters $\theta_1, \theta_2, \ldots, \theta_k$ simultaneously, Gibbs sampling updates one parameter at a time by sampling from its conditional distribution, keeping all other parameters fixed at their current values.

Let $\theta = (\theta_1, \theta_2, \theta_3)$ . A Gibbs step involves:

Sample $\theta_1^{(i+1)}$ from $P(\theta_1 | \theta_2^{(i)}, \theta_3^{(i)}, D)$
Sample $\theta_2^{(i+1)}$ from $P(\theta_2 | \theta_1^{(i+1)}, \theta_3^{(i)}, D)$
Sample $\theta_3^{(i+1)}$ from $P(\theta_3 | \theta_1^{(i+1)}, \theta_2^{(i+1)}, D)$

This iterative process vastly simplifies the sampling problem because the one-dimensional conditional distributions are often well-known and easy to sample from, even when the joint multidimensional posterior is impossibly complex.

The Medical Test Paradox

You are a doctor administering a test for a rare genetic marker present in 0.1% (p=0.001) of the population. The test's sensitivity (true positive rate) is 99% (P(Positive|Marker) = 0.99). The test's specificity (true negative rate) is 98%, meaning the false positive rate is 2% (P(Positive|No Marker) = 0.02). A patient receives a positive test result. The patient immediately asks: 'What is the probability I actually have the marker?'

Calculate the Posterior probability that the patient has the genetic marker given the positive result.

Implementation: Bayesian Continuous Updating

Below is an illustration utilizing the Beta-Conjugate prior for a binomial likelihood, perfectly modeling the continuous updating of beliefs about a coin’s hidden fairness parameter. Observe how the posterior from one experiment becomes the prior for the next.

python

Interactive Lab

Read the code, make a small change, then run it and inspect the output. Runtime setup messages stay outside the terminal so the result remains focused on what the program prints.

Step 1

Inspect the idea

Step 2

Edit the program

Step 3

Run and compare

Exercises

In the context of Bayesian statistics, what is the defining characteristic of a Conjugate Prior?

How does Gibbs Sampling simplify the process of evaluating a complex, high-dimensional posterior distribution?

Which interpretation correctly identifies a key difference between frequentist Confidence Intervals and Bayesian Credible Intervals?

Section Detail

Markov Chains

A Markov Chain is a mathematical system that undergoes transitions from one state to another on a state space. It is a stochastic process characterized by the Markov property: the conditional probability distribution of future states of the process depends only upon the present state, not on the sequence of events that preceded it.

Formally, a stochastic process $\{X_n : n \in \mathbb{N}_0\}$ is a Markov chain if, for all $n \ge 0$ and any sequence of states $i_0, i_1, \dots, i_{n-1}, i, j$ , the following equality holds:

$\mathbb{P}(X_{n+1} = j \mid X_n = i, X_{n-1} = i_{n-1}, \dots, X_0 = i_0) = \mathbb{P}(X_{n+1} = j \mid X_n = i)$

This fundamental property states that the entire history of the process is encapsulated in its current state $X_n=i$ . This drastically simplifies the study of complex systems, reducing an infinite-dimensional dependency into a single-step conditional probability. Discrete and continuous-time variants form the backbone of modern stochastic modeling, encompassing applications ranging from simple queuing systems to complex financial models and molecular dynamics.

Discrete-Time Markov Chains (DTMC)

A Discrete-Time Markov Chain operates with a discrete time parameter $n \in \{0, 1, 2, \dots\}$ . The set of possible values for the random variables $X_n$ forms a countable set $S$ , called the state space. The probability of moving from state $i$ to state $j$ in one time step is given by the transition probability $p_{ij}$ , defined as:

$p_{ij} = \mathbb{P}(X_{n+1} = j \mid X_n = i)$

When these transition probabilities are independent of the time step $n$ , the Markov chain is said to be time-homogeneous. We will strictly focus on time-homogeneous chains, as their structure permits robust long-term behavioral analysis.

Transition Matrices

For a state space containing a finite number of states (or countably infinite), the one-step transition probabilities $p_{ij}$ are arranged in a matrix $P$ , called the transition matrix:

$P = \begin{pmatrix} p_{00} & p_{01} & p_{02} & \cdots \\ p_{10} & p_{11} & p_{12} & \cdots \\ \vdots & \vdots & \vdots & \ddots \end{pmatrix}$

This matrix has two vital properties:

$p_{ij} \ge 0$ for all $i, j \in S$ .
$\sum_{j \in S} p_{ij} = 1$ for all $i \in S$ .

Every row describes a probability distribution, making $P$ a stochastic matrix. If the initial distribution of the chain is a row vector $\pi^{(0)}$ (where $\pi_i^{(0)} = \mathbb{P}(X_0 = i)$ ), the distribution after one step is $\pi^{(1)} = \pi^{(0)} P$ . By induction, the probability distribution of the state after $n$ steps is given by $\pi^{(n)} = \pi^{(0)} P^n$ . The matrix multiplication organically computes the sum over all possible paths of length $n$ between any two states, weighting each path by its probability.

$n$ -Step Transition Probabilities

The $n$ -step transition probability is the probability that a process currently in state $i$ will be in state $j$ exactly $n$ steps later:

$p_{ij}^{(n)} = \mathbb{P}(X_{n+k} = j \mid X_k = i)$

For $n=1$ , $p_{ij}^{(1)} = p_{ij}$ . For $n=0$ , $p_{ij}^{(0)}$ is $1$ if $i=j$ and $0$ otherwise.

If the transition matrix $P$ of a 3-state Markov chain has row sums of 1, what must be true about the row sums of $P^2$?

Chapman-Kolmogorov Equations

The computation of $n$ -step transition probabilities is fundamentally governed by the Chapman-Kolmogorov equations. These equations provide a rigorous method for computing the probability of moving from state $i$ to state $j$ in $n+m$ steps by conditioning on the intermediate state $k$ attained after $n$ steps:

$p_{ij}^{(n+m)} = \sum_{k \in S} p_{ik}^{(n)} p_{kj}^{(m)}$

In matrix notation, this corresponds exactly to the multiplication of powers of the transition matrix: Let $P^{(n)}$ be the matrix whose entries are $p_{ij}^{(n)}$ . Then $P^{(n+m)} = P^{(n)} P^{(m)}$ . Consequently, $P^{(n)} = P^n$ . The equation elegantly states that the transition matrix for $n$ steps is the $n$ -th power of the 1-step transition matrix.

Classification of States

The long-term behavior of a Markov chain is heavily dependent on the communication structure and the topological arrangement of its state space.

Accessibility and Communication

State $j$ is accessible from state $i$ (denoted $i \to j$ ) if there exists an integer $n \ge 0$ such that $p_{ij}^{(n)} > 0$ . Simply put, there is a path of non-zero probability from $i$ to $j$ .
States $i$ and $j$ communicate (denoted $i \leftrightarrow j$ ) if $i \to j$ and $j \to i$ .

Communication is an equivalence relation (it is reflexive, symmetric, and transitive), which partitions the state space into disjoint communication classes. If a Markov chain has only one communication class—meaning every state is accessible from every other state—it is called irreducible.

Recurrent and Transient States

Let $f_{ij}^{(n)}$ denote the probability that the first transition into state $j$ (starting from $i$ ) occurs exactly at step $n$ : $f_{ij}^{(n)} = \mathbb{P}(X_n = j, X_k \neq j \text{ for } k = 1, \dots, n-1 \mid X_0 = i)$

Let $f_{ij} = \sum_{n=1}^\infty f_{ij}^{(n)}$ be the probability of ever reaching state $j$ given that the chain started in state $i$ . The parameter $f_{ii}$ is therefore the probability of ever returning to state $i$ given that the chain started in state $i$ .

A state $i$ is recurrent if $f_{ii} = 1$ . A recurrent state will be visited infinitely many times with probability $1$ .
A state $i$ is transient if $f_{ii} < 1$ . A transient state will be visited only a finite number of times with probability $1$ .

A state is recurrent if and only if the expected number of returns to that state is infinite: $\sum_{n=1}^\infty p_{ii}^{(n)} = \infty$ . It is transient if and only if $\sum_{n=1}^\infty p_{ii}^{(n)} < \infty$ . Every finite Markov chain has at least one recurrent state, though an infinite state space may consist entirely of transient states (e.g., a simple random walk on $\mathbb{Z}^3$ ).

Periodicity

The period $d(i)$ of a state $i$ is defined as the greatest common divisor (GCD) of the set of numbers of steps $n$ for which a return to state $i$ is possible: $d(i) = \gcd \{ n \ge 1 : p_{ii}^{(n)} > 0 \}$

If $d(i) = 1$ , the state is aperiodic. Returns can occur at irregular intervals without a fixed rigid period.
If $d(i) > 1$ , the state is periodic with period $d$ .

For irreducible chains, periodicity is a class property: all states in the same communication class have the same period.

Ergodic States

A state $i$ is positive recurrent if it is recurrent and its expected return time $m_i$ is finite: $m_i = \sum_{n=1}^\infty n f_{ii}^{(n)} < \infty$ If a state is positive recurrent and aperiodic, it is classified as ergodic. A Markov chain is defined as ergodic if all its states are ergodic. Ergodicity is the bedrock property guaranteeing that a system will eventually “forget” its initial state and settle into a stable proportional equilibrium.

Stationary distributions

When an ergodic Markov chain runs for a sufficiently long time, its distribution approaches a steady state, completely independent of the starting state. This limiting distribution is called the stationary distribution, denoted by a row vector $\pi$ .

A probability distribution $\pi$ is a stationary distribution if:

$\pi_j \ge 0$ for all $j \in S$ .
$\sum_{j \in S} \pi_j = 1$ .
$\pi P = \pi$ .

The condition $\pi = \pi P$ indicates that if you start the chain randomly by picking the initial state according to the distribution $\pi$ , the state distribution at any subsequent step remains exactly $\pi$ .

For an irreducible, aperiodic, and positive recurrent (i.e., ergodic) Markov chain, a unique stationary distribution $\pi$ exists, and the fundamental limit theorem applies:

$\lim_{n \to \infty} p_{ij}^{(n)} = \pi_j \quad \text{for all } i, j \in S$

Furthermore, the stationary probability is inversely proportional to the expected return time: $\pi_j = 1/m_j$ . This provides a profound link between the limits of transition probabilities and the stochastic temporal behavior of the chain.

The Gambler's Ruin

A gambler plays a fair game where they win $1 with probability $0.5$ and lose $1 with probability $0.5$ at each step. The gambler starts with $\$a$ and the game ends when their capital reaches $0$ (ruin) or a predetermined target value $\$N$ (success). This process can be seamlessly modeled as a discrete-time Markov chain with state space $S = \{0, 1, 2, \dots, N\}$ where states $0$ and $N$ represent the termination of the game.

We are analyzing classification of states. Are the transient states guaranteed to be left forever, and what is the nature of states 0 and N within the context of state classifications?

Deep Dive into Continuous-Time Markov Chains (CTMC)

While discrete-time Markov chains rigidly describe systems transitioning at fixed, discrete time steps, vastly many real-world stochastic processes change state at random, continuously distributed times along the $t \in [0, \infty)$ axis. Such processes are modeled as Continuous-Time Markov Chains (CTMC).

A stochastic process $\{X(t) : t \ge 0\}$ defined on a discrete state space $S$ is a CTMC if it satisfies the strict continuous-time Markov property: $\mathbb{P}(X(t+s) = j \mid X(s) = i, X(u) \text{ for } 0 \le u < s) = \mathbb{P}(X(t+s) = j \mid X(s) = i)$

For a time-homogeneous CTMC, the transition probability only depends on the length of the time interval $t$ : $p_{ij}(t) = \mathbb{P}(X(s+t) = j \mid X(s) = i)$

Holding Times and Transition Rates

When a CTMC enters a state $i$ , the amount of time it spends in that state before making a sudden transition—called the holding time or sojourn time—strictly follows an exponential distribution with a rate parameter $q_i$ (often denoted $v_i$ or $\lambda_i$ ).

Why an exponential distribution? The exponential distribution is the only strictly continuous probability distribution possessing the memoryless property. The Markov assumption fundamentally requires that the time already spent in a state yields zero new information about the remaining time to be spent in that state.

When the process inevitably leaves state $i$ , the probability it transitions specifically to state $j$ is independent of the holding time and is denoted by the transition probability $p_{ij}$ , where $\sum_{j \neq i} p_{ij} = 1$ and $p_{ii} = 0$ .

Equivalently, one specifies the unnormalized transition rates $q_{ij}$ , defined precisely as the rate at which the continuous process transitions from state $i$ to state $j$ : $q_{ij} = q_i p_{ij} \quad \text{for } i \neq j$

These transition rates are compactly arranged in the generator matrix (or infinitesimal generator) $Q$ , whose scalar elements are given by:

$Q_{ij} = q_{ij}$ for $i \neq j$
$Q_{ii} = -q_i = -\sum_{j \neq i} q_{ij}$

Because of this specific continuous balancing formulation, the row sums of the generator matrix $Q$ are identically $0$ across all rows: $\sum_{j \in S} Q_{ij} = 0$

The Kolmogorov Forward and Backward Equations

In discrete time, matrices multiply simply via algebraic powers $P^{(n)} = P^n$ . In continuous time, the transition matrices $P(t) = \{p_{ij}(t)\}$ satisfy systems of coupled linear differential equations instead of algebraic relations, linking the finite time transition probabilities to the instantaneous transition rates mathematically encoded in the matrix $Q$ .

Kolmogorov Backward Equations: $\frac{d}{dt} P(t) = Q P(t)$ Component-wise, this elegantly expands to $\frac{d}{dt} p_{ij}(t) = \sum_k q_{ik} p_{kj}(t)$ . These differential equations calculate probabilities by conditioning on the first transition out of the initial starting state.

Kolmogorov Forward Equations: $\frac{d}{dt} P(t) = P(t) Q$ Component-wise, this equates to $\frac{d}{dt} p_{ij}(t) = \sum_k p_{ik}(t) q_{kj}$ . The forward equations construct the probability distribution by conditioning on the final transition immediately preceding time $t$ .

Provided sufficient regularity conditions (which automatically hold firm in all finite state spaces), the solution to these initial value problems (with boundary condition $P(0) = I$ , the identity matrix) is given identically by the matrix exponential function: $P(t) = e^{Qt} = \sum_{n=0}^\infty \frac{(Qt)^n}{n!}$

Stationary Distributions in CTMCs

Much like in DTMCs, under the correct irreducibility and positive-recurrence topological assumptions, a continuous-time Markov chain invariably possesses a stationary distribution $\pi$ governing the exact long-term steady-state proportion of time the process spends occupying each state.

However, the geometric algebraic condition $\pi = \pi P$ is dynamically replaced by a differential equilibrium corresponding to a zero net rate of probability flux: $\pi Q = 0$

Here, $\pi$ remains a normalized probability vector with $\sum \pi_i = 1$ . The matrix equation $\pi Q = 0$ corresponds exactly to a set of global balance equations stating firmly that the total probability flux leaving state $j$ strictly equals the total probability flux entering state $j$ from all other states combined.

$\pi_j q_j = \sum_{i \neq j} \pi_i q_{ij}$

This flux balance principle is absolutely foundational to modern queuing theory, stochastic chemical reaction networks, and biological population models, permanently bridging the highly abstract formulations of analytical probability into powerful mathematical tools used for rigorously evaluating complex dynamic system metrics over infinite continuous-time horizons.

Section Detail

Stochastic Processes

A stochastic process is a mathematical object defined as a collection of random variables defined on a common probability space $(\Omega, \mathcal{F}, \mathbb{P})$ , indexed by a totally ordered set $T$ (usually representing time). Formally, a stochastic process is parameterized as $X = \{X_t : t \in T\}$ , where for each $t \in T$ , $X_t$ is an $\mathcal{F}$ -measurable function mapping $\Omega \to S$ for measurable state space $(S, \mathcal{S})$ .

When $T = \mathbb{N}$ or $\mathbb{Z}$ , the process is cast as a discrete-time stochastic process. If $T = [0, \infty)$ or $T \subset \mathbb{R}$ , it represents a continuous-time stochastic process. The state space $S$ determines whether the process is discrete-state (e.g., integer values) or continuous-state (e.g., real-valued).

Filtrations and Information

To rigorously describe the evolution of a stochastic process, it is essential to capture the accumulation of information over time. This is formalized by a filtration $\mathbb{F} = \{\mathcal{F}_t\}_{t \in T}$ , which is an increasing family of sub- $\sigma$ -algebras of $\mathcal{F}$ . That is, $\mathcal{F}_s \subseteq \mathcal{F}_t \subseteq \mathcal{F}$ for all $s \leq t$ .

The intuitive interpretation of $\mathcal{F}_t$ is the “history” or the “available information” up to time $t$ . A stochastic process $\{X_t\}_{t \in T}$ is said to be adapted to the filtration $\mathbb{F}$ if, for every $t \in T$ , the random variable $X_t$ is $\mathcal{F}_t$ -measurable. This implies that if one observes the state of the universe up to time $t$ , the value of $X_t$ is completely known.

Martingales

Martingales constitute one of the most fundamental classes of stochastic processes, generalizing the concept of a “fair game” where knowledge of past events never helps predict expected future winnings.

Let $(\Omega, \mathcal{F}, \{\mathcal{F}_t\}, \mathbb{P})$ be a filtered probability space. A real-valued stochastic process $\{M_t\}_{t \in T}$ is a martingale with respect to the filtration $\{\mathcal{F}_t\}$ and probability measure $\mathbb{P}$ if it satisfies the following three conditions:

Adaptedness: $M_t$ is $\mathcal{F}_t$ -measurable for all $t$ .
Integrability: $\mathbb{E}[|M_t|] < \infty$ for all $t$ (i.e., $M_t \in L^1(\mathbb{P})$ ).
Martingale Property: For all $s \leq t$ , the conditional expectation satisfies: $\mathbb{E}[M_t \mid \mathcal{F}_s] = M_s \quad \text{almost surely (a.s.)}$

If the equality in the third condition is replaced with $\leq$ (or $\geq$ ), the process is termed a supermartingale (or submartingale). In a supermartingale, the expected future value is less than or equal to the current value (a losing game), whereas in a submartingale, it is greater than or equal to the current value (a winning game).

Discrete-Time Martingales

Consider a simple symmetric random walk $S_n = \sum_{i=1}^n X_i$ , where the increments $X_i$ are independent, identically distributed (i.i.d.) random variables with $\mathbb{P}(X_i = 1) = 1/2$ and $\mathbb{P}(X_i = -1) = 1/2$ . Let $\mathcal{F}_n = \sigma(X_1, \dots, X_n)$ be the natural filtration. Check that $S_n$ is a martingale:

$\mathbb{E}[S_{n+1} \mid \mathcal{F}_n] = \mathbb{E}[S_n + X_{n+1} \mid \mathcal{F}_n] = \mathbb{E}[S_n \mid \mathcal{F}_n] + \mathbb{E}[X_{n+1} \mid \mathcal{F}_n]$

Since $S_n$ is $\mathcal{F}_n$ -measurable, $\mathbb{E}[S_n \mid \mathcal{F}_n] = S_n$ . Since $X_{n+1}$ is independent of $\mathcal{F}_n$ , $\mathbb{E}[X_{n+1} \mid \mathcal{F}_n] = \mathbb{E}[X_{n+1}] = 0$ . Thus, $\mathbb{E}[S_{n+1} \mid \mathcal{F}_n] = S_n$ , proving $S_n$ is a discrete-time martingale.

Let $M_t$ be a martingale. Which of the following statements strictly describes its conditional expectation characteristic?

Stopping Times

In many practical and theoretical contexts, we are interested in evaluating models at random times (e.g., the time a stock hits a certain price or the time a gambler goes bankrupt). This gives rise to the concept of a stopping time.

A random variable $\tau: \Omega \to T \cup \{\infty\}$ is a stopping time (or Markov time) with respect to a filtration $\{\mathcal{F}_t\}$ if, for every $t \in T$ , the event $\{\tau \leq t\} \in \mathcal{F}_t$ . Intuitively, at any given time $t$ , one can determine whether the stopping time has occurred strictly based on the information available up to time $t$ . A stopping time cannot look into the future.

For a stochastic process $\{X_t\}$ , the first hitting time of a Borel set $B \in \mathcal{B}(\mathbb{R})$ is defined as: $\tau_B = \inf \{ t \geq 0 : X_t \in B \}$ When the process has right-continuous paths and $B$ is a closed set, $\tau_B$ is guaranteed to be a stopping time.

Optional Stopping Theorem

Does evaluating a martingale at a stopping time $\tau$ preserve its expected value? In general, it might not. However, Doob’s Optional Stopping Theorem establishes the conditions under which the expected value at the stopping time equals the initial expected value, i.e., $\mathbb{E}[M_\tau] = \mathbb{E}[M_0]$ .

Let $(M_n)_{n \geq 0}$ be a discrete-time martingale and $\tau$ be a stopping time with respect to the filtration $(\mathcal{F}_n)$ . Then $\mathbb{E}[M_\tau] = \mathbb{E}[M_0]$ holds if any of the following conditions is satisfied:

The stopping time is bounded almost surely: $\mathbb{P}(\tau \leq N) = 1$ for some deterministic integer $N$ .
The stopping time has a finite expectation $\mathbb{E}[\tau] < \infty$ , and the increments are conditionally bounded: there exists $c > 0$ such that $\mathbb{E}[|M_{n+1} - M_n| \mid \mathcal{F}_n] \leq c$ a.s. on $\{\tau > n\}$ .
There exists a constant $C$ such that $|M_{n \wedge \tau}| \leq C$ almost surely for all $n$ .

This theorem highlights the impossibility of formulating a systemic winning strategy in a fair game under bounded resource constraints (the origin of the impossibility of the classical “Martingale betting strategy”).

Brownian Motion (Wiener Process)

The Wiener process (or standard Brownian motion) is the fundamental continuous-time analog of the random walk. It drives modern financial theory, statistical mechanics, and continuous-state probability.

A standard one-dimensional Wiener process $W = \{W_t\}_{t \ge 0}$ is a stochastic process characterized by the following properties:

$W_0 = 0$ almost surely.
$W$ has independent increments: For any $0 \leq t_1 < t_2 < \dots < t_k$ , the random variables $W_{t_1}, W_{t_2} - W_{t_1}, \dots, W_{t_k} - W_{t_{k-1}}$ are independent.
$W$ has stationary normally distributed increments: For any $0 \leq s < t$ , the increment $W_t - W_s$ follows a normal distribution: $W_t - W_s \sim \mathcal{N}(0, t - s)$
The paths $t \mapsto W_t$ are almost surely continuous.

Despite being continuous everywhere, the path of a Brownian motion is differentiable nowhere. Its quadratic variation over the interval $[0,t]$ is exactly $t$ . That is, $\lim_{||\Pi|| \to 0} \sum_{i=0}^{n-1} (W_{t_{i+1}} - W_{t_i})^2 = t$ . This strict non-zero quadratic variation is the very reason why ordinary calculus (Newton-Leibniz) fails for stochastic processes and necessitate a distinct calculus.

python

Interactive Lab

Read the code, make a small change, then run it and inspect the output. Runtime setup messages stay outside the terminal so the result remains focused on what the program prints.

Step 1

Inspect the idea

Step 2

Edit the program

Step 3

Run and compare

Itô’s Lemma

Because Brownian motion has non-zero quadratic variation, the standard chain rule of differential calculus does not hold. Instead, we use Itô’s Calculus, anchored by Itô’s Lemma.

Let $X_t$ be an Itô drift-diffusion process satisfying the stochastic differential equation: $dX_t = \mu_t dt + \sigma_t dW_t$ where $W_t$ is a standard Wiener process, and $\mu_t, \sigma_t$ are adapted processes. Let $f(t,x)$ be a scalar function that is twice continuously differentiable in $x$ and once in $t$ (i.e., $f \in C^{1,2}([0, \infty) \times \mathbb{R})$ ).

By Itô’s Lemma, the process $Y_t = f(t, X_t)$ is also an Itô process whose differential is given by: $df(t, X_t) = \left( \frac{\partial f}{\partial t} + \mu_t \frac{\partial f}{\partial x} + \frac{1}{2} \sigma_t^2 \frac{\partial^2 f}{\partial x^2} \right) dt + \sigma_t \frac{\partial f}{\partial x} dW_t$

The profound emergence of the term $\frac{1}{2} \sigma_t^2 \frac{\partial^2 f}{\partial x^2} dt$ reflects the quadratic variation of $X_t$ , often formalized by the heuristic multiplication rules: $dt \cdot dt = 0, \quad dt \cdot dW_t = 0, \quad (dW_t)^2 = dt$

Geometric Brownian Motion & Itô's Lemma

In quantitative finance, the standard model for a stock price $S_t$ assumes the proportional return $dS_t / S_t$ undergoes constant drift and volatility, modeled by the stochastic differential equation: $dS_t = \mu S_t dt + \sigma S_t dW_t$. To find the distribution of $S_t$, we need to solve this. Applying standard ODE techniques fails because of the $dW_t$ term. We must use Itô's lemma to transform the equation, commonly via the natural logarithm function.

Apply Itô's Lemma to the function $f(t, S_t) = \ln(S_t)$ where $dS_t = \mu S_t dt + \sigma S_t dW_t$. What is the resulting stochastic differential equation for $d(\ln S_t)$?

Stochastic Differential Equations (SDEs)

A Stochastic Differential Equation relates the continuous-time dynamics of a stochastic process to a deterministic drift part and a stochastic diffusion part. The general form is: $dX_t = b(t, X_t) dt + \sigma(t, X_t) dW_t$ This equation is simply a symbolic shorthand for the integral equation: $X_t = X_0 + \int_0^t b(s, X_s) ds + \int_0^t \sigma(s, X_s) dW_s$ where the first integral is a standard Lebesgue/Riemann integral and the second is an Itô stochastic integral.

Existence and Uniqueness

Much like Picard–Lindelöf for deterministic ODEs, there are conditions for the strong existence and uniqueness of solutions to SDEs. Under Lipschitz continuity and linear growth bounding conditions:

Lipschitz Condition: $|b(t, x) - b(t, y)| + |\sigma(t, x) - \sigma(t, y)| \leq K|x - y|$
Linear Growth: $|b(t, x)|^2 + |\sigma(t, x)|^2 \leq C(1 + |x|^2)$

for some constants $K, C > 0$ and all $t, x, y$ , there exists a unique strong solution $X_t$ to the SDE.

The analysis, simulation, and integration of SDEs form the bedrock of continuously evolving systems subject to noise across physics, mathematical biology, and finance.

In the SDE framework, what does the term 'diffusion coefficient' refer to?

Overview

Section Detail

Regression Analysis

Regression analysis is a statistical method for estimating the relationships among variables. It focuses primarily on the relationship between a dependent variable (often called the response or outcome variable) and one or more independent variables (often called predictors, covariates, or explanatory variables). The objective is to understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed.

Simple Linear Regression

The most fundamental form of regression analysis is simple linear regression, which models the relationship between a single independent variable $X$ and a dependent variable $Y$ . The true relationship is postulated to be a linear function of $X$ plus a stochastic error term.

The population model is defined as: $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$

where:

$Y_i$ is the $i$ -th observation of the dependent variable.
$X_i$ is the $i$ -th observation of the independent variable.
$\beta_0$ is the $y$ -intercept (the expected value of $Y$ when $X = 0$ ).
$\beta_1$ is the slope coefficient (the expected change in $Y$ for a one-unit change in $X$ ).
$\epsilon_i$ is the unobserved random error or disturbance term.

Assumptions of Simple Linear Regression

For the standard estimation techniques to be valid and possess desirable statistical properties, certain assumptions regarding the error term $\epsilon_i$ must hold:

Linearity: The expected value of the response variable is a linear function of the explanatory variables. $\mathbb{E}[Y | X] = \beta_0 + \beta_1 X$ , meaning $\mathbb{E}[\epsilon | X] = 0$ .
Independence: The errors are independent of each other. $\text{Cov}(\epsilon_i, \epsilon_j) = 0$ for all $i \neq j$ .
Homoscedasticity (Constant Variance): The errors have a constant variance across all levels of the independent variable. $\text{Var}(\epsilon_i | X_i) = \sigma^2$ for all $i$ .
Normality (Optional for estimation, required for inference): The errors are normally distributed. $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ .

Ordinary Least Squares (OLS) Estimation

The most common method for estimating the unknown parameters $\beta_0$ and $\beta_1$ is Ordinary Least Squares (OLS). The OLS method chooses the estimates $\hat\beta_0$ and $\hat\beta_1$ that minimize the sum of the squared residuals (SSR).

The residual $e_i$ for the $i$ -th observation is the difference between the observed $Y_i$ and the predicted value $\hat{Y}_i$ : $e_i = Y_i - \hat{Y}_i = Y_i - (\hat\beta_0 + \hat\beta_1 X_i)$

The Sum of Squared Residuals (SSR) is: $S(\hat\beta_0, \hat\beta_1) = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (Y_i - \hat\beta_0 - \hat\beta_1 X_i)^2$

To minimize $S$ , we take the partial derivatives with respect to $\hat\beta_0$ and $\hat\beta_1$ and set them to zero: $\frac{\partial S}{\partial \hat\beta_0} = -2 \sum_{i=1}^n (Y_i - \hat\beta_0 - \hat\beta_1 X_i) = 0$ $\frac{\partial S}{\partial \hat\beta_1} = -2 \sum_{i=1}^n X_i (Y_i - \hat\beta_0 - \hat\beta_1 X_i) = 0$

Solving these normal equations yields the OLS estimators: $\hat\beta_1 = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2} = \frac{\text{Cov}(X, Y)}{\text{Var}(X)}$ $\hat\beta_0 = \bar{Y} - \hat\beta_1 \bar{X}$

Where $\bar{X}$ and $\bar{Y}$ are the sample means of $X$ and $Y$ , respectively.

If the sample covariance between independent variable X and dependent variable Y is exactly zero, what is the value of the OLS estimator for the slope ($\hat\beta_1$)?

Multiple Linear Regression

Multiple linear regression extends the simple linear model to include two or more independent variables. The model with $k$ predictors is written as: $Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_k X_{ik} + \epsilon_i$

Because writing out summations becomes unwieldy, multiple regression is almost universally represented using matrix algebra.

Let $Y$ be an $n \times 1$ vector of observations of the dependent variable, $X$ be an $n \times (k+1)$ matrix (the design matrix) where the first column is typically all 1s (for the intercept), $\beta$ be a $(k+1) \times 1$ vector of parameters, and $\epsilon$ be an $n \times 1$ vector of errors.

$Y = X\beta + \epsilon$

\begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} = \begin{bmatrix} 1 & X_{11} & \cdots & X_{1k} \\ 1 & X_{21} & \cdots & X_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n1} & \cdots & X_{nk} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}

The OLS estimator vector $\hat\beta$ minimizes $(Y - X\hat\beta)^T (Y - X\hat\beta)$ . Expanding this and taking the derivative with respect to the vector $\hat\beta$ yields the matrix formulation of the normal equations: $X^T X \hat\beta = X^T Y$

Assuming $X^T X$ is invertible (which requires no perfect multicollinearity among the predictors), the OLS estimator is: $\hat\beta = (X^T X)^{-1} X^T Y$

The Gauss-Markov Theorem

The Gauss-Markov theorem justifies the use of the OLS estimator. It states that under the classical linear regression model assumptions (linearity, strict exogeneity/independence, no perfect multicollinearity, and homoscedasticity), the OLS estimator $\hat\beta$ is the Best Linear Unbiased Estimator (BLUE).

Linear: $\hat\beta$ is a linear function of the observed random variables $Y$ . We can write $\hat\beta = AY$ where $A = (X^T X)^{-1} X^T$ .
Unbiased: The expected value of the estimator is the true parameter. $\mathbb{E}[\hat\beta] = \beta$ . $\mathbb{E}[\hat\beta] = \mathbb{E}[(X^T X)^{-1} X^T (X\beta + \epsilon)] = \beta + (X^T X)^{-1} X^T \mathbb{E}[\epsilon] = \beta + 0 = \beta$
Best: It has the minimum variance among all linear unbiased estimators. $\text{Var}(\hat\beta_{OLS}) \leq \text{Var}(\tilde{\beta})$ for any other linear unbiased estimator $\tilde{\beta}$ .

The variance-covariance matrix of the OLS estimator is: $\text{Var}(\hat\beta) = \sigma^2 (X^T X)^{-1}$ Where $\sigma^2$ is the variance of the error term, typically estimated by $s^2 = \frac{e^T e}{n - k - 1}$ , with $e$ being the vector of residuals.

Which assumption is NOT required for the OLS estimators to be unbiased (part of the Gauss-Markov theorem)?

Goodness of Fit and Inference

To assess how well the model fits the data, we decompose the total variation in the dependent variable into explained and unexplained components.

Total Sum of Squares (SST): Measures the total variation in $Y$ around its mean. $\text{SST} = \sum_{i=1}^n (Y_i - \bar{Y})^2$
Model/Explained Sum of Squares (SSM): Measures the variation in $Y$ explained by the regression model. $\text{SSM} = \sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2$
Residual/Error Sum of Squares (SSR): Measures the variation in $Y$ not explained by the model. $\text{SSR} = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^n e_i^2$

The relationship is $\text{SST} = \text{SSM} + \text{SSR}$ .

Coefficient of Determination ( $R^2$ )

The $R^2$ statistic represents the proportion of variance in the dependent variable explained by the independent variables in the model. $R^2 = \frac{\text{SSM}}{\text{SST}} = 1 - \frac{\text{SSR}}{\text{SST}}$

While $0 \leq R^2 \leq 1$ , adding more predictors to a model will mechanically never decrease $R^2$ , even if the predictors are irrelevant. To account for this, the Adjusted $R^2$ penalizes models for adding variables that do not significantly improve the fit: $\bar{R}^2 = 1 - \left( \frac{\text{SSR} / (n - k - 1)}{\text{SST} / (n - 1)} \right) = 1 - (1 - R^2)\frac{n-1}{n-k-1}$

Hypothesis Testing

Under the assumption that $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$ , the OLS estimators are normally distributed: $\hat\beta \sim \mathcal{N}(\beta, \sigma^2 (X^T X)^{-1})$

Test of Individual Significance (t-test)

To test the hypothesis that a single independent variable $X_j$ has no effect on $Y$ (i.e., $H_0: \beta_j = 0$ ), a t-statistic is used: $t = \frac{\hat\beta_j - 0}{\text{SE}(\hat\beta_j)}$ where $\text{SE}(\hat\beta_j)$ is the standard error of the estimate, found directly from the square root of the $j$ -th diagonal element of the estimated variance-covariance matrix $s^2(X^T X)^{-1}$ . Under the null hypothesis, this statistic follows a Student’s t-distribution with $n - k - 1$ degrees of freedom.

Test of Overall Significance (F-test)

To test the joint hypothesis that all slope coefficients (excluding the intercept) are simultaneously zero ( $H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0$ ), an F-statistic is constructed from the sums of squares: $F = \frac{\text{SSM} / k}{\text{SSR} / (n - k - 1)}$ Under the null hypothesis, this follows an F-distribution with $(k, n - k - 1)$ degrees of freedom. A large F-statistic provides evidence against the null hypothesis, indicating that at least one predictor variable is significantly related to the response variable.

Analyzing Real Estate Valuation

A data scientist constructs a multiple linear regression model to predict the price of houses ($Y$, in thousands of dollars) based on square footage ($X_1$), age of the house ($X_2$, in years), and distance to the city center ($X_3$, in miles). The estimated model is $\hat{Y} = 150 + 0.2X_1 - 1.5X_2 - 5.0X_3$. The $R^2$ is 0.75, the Adjusted $R^2$ is 0.74, and the sample size is $n=100$. The standard error for $\hat\beta_2$ is $0.5$.

You want to formally test if the age of the house has a statistically significant effect on the price at a 5% significance level. Calculate the t-statistic for $\hat\beta_2$ and describe the conclusion. Assume the critical t-value for $df = 96$ at $\alpha=0.05$ (two-tailed) is approximately 1.98.

Residual Diagnostics

Estimation is only part of the process; structural validation ensures the model assumptions hold. Analyzing the residuals ( $e_i$ ) is the primary tool for diagnostics.

Non-linearity: Plotting residuals against predicted values ( $\hat{Y}_i$ ) or individual predictors ( $X_i$ ). A non-random U-shape or pattern suggests the relationship is non-linear, perhaps requiring polynomial terms or transformations.
Heteroscedasticity: If the spread of the residuals increases or decreases with $\hat{Y}_i$ (often forming a “funnel” shape in a residual plot), the constant variance assumption is violated. This makes OLS standard errors incorrect, invalidating hypothesis tests. Robust standard errors or Weighted Least Squares (WLS) can address this.
Non-normality: A Normal Q-Q (quantile-quantile) plot compares the distribution of the residuals to a theoretical normal distribution. Significant deviations from the straight line, particularly at the tails, imply non-normal errors.
Outliers and Leverage: Observations with extreme $Y$ values given their $X$ values are outliers. Observations with extreme $X$ values have high leverage. Points with both high leverage and large residuals exert undue influence on the regression line. Cook’s Distance is a metric used to quantify the overall influence of an observation on the estimated coefficients.

$D_i = \frac{\sum_{j=1}^n (\hat{Y}_j - \hat{Y}_{j(i)})^2}{(k+1)s^2}$ where $\hat{Y}_{j(i)}$ is the predicted value of the $j$ -th observation when the model is refitted without the $i$ -th observation. A high Cook’s distance indicates a highly influential data point.

Regression analysis serves as the foundational mathematical bedrock for predictive modeling and causal inference, bridging classical statistics to modern machine learning applications.

Section Detail

Analysis of Variance (ANOVA)

Analysis of Variance (ANOVA) is a collection of statistical models and their associated estimation procedures used to analyze the differences among group means in a sample. ANOVA was developed by statistician and evolutionary biologist Ronald Fisher. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the $t$ -test beyond two means.

While the $t$ -test is limited to comparing two groups, applying multiple $t$ -tests across several groups exponentially increases the Type I error rate (false positives). ANOVA controls this error rate by evaluating the entire set of groups simultaneously, partitioning the observed variance in a particular variable into components attributable to different sources of variation.

The Logic of Variance Partitioning

The fundamental mechanism of ANOVA is the partitioning of total variance into two primary components:

Between-Group Variance: The variance of the group means around the grand mean. This reflects the effect of the independent variable(s) plus error.
Within-Group Variance: The variance of individual scores around their respective group means. This reflects pure error (unexplained variance).

If the between-group variance is significantly larger than the within-group variance, it indicates that the independent variable has a significant effect on the dependent variable.

Assumptions of ANOVA

The validity of ANOVA relies on three core assumptions:

Independence of Observations: The residuals must be mutually independent. This is fundamentally a design issue handled through random sampling and random assignment.
Normality: The residuals of the model are normally distributed. While ANOVA is robust to moderate violations of normality (especially with large, equal sample sizes due to the Central Limit Theorem), severe skewness or outliers can compromise the $F$ -test.
Homogeneity of Variances (Homoscedasticity): The variances of the populations from which the samples are drawn are equal. This is tested using Levene’s Test or Bartlett’s Test. Welch’s ANOVA can be used if this assumption is heavily violated.

One-Way ANOVA

A One-Way ANOVA involves a single independent variable (factor) with three or more categorical levels. The model for an observation $y_{ij}$ (the $i$ -th observation in the $j$ -th group) is given by:

$y_{ij} = \mu + \tau_j + \varepsilon_{ij}$

Where:

$\mu$ is the grand mean.
$\tau_j$ is the treatment effect for the $j$ -th group (where $\sum \tau_j = 0$ ).
$\varepsilon_{ij}$ is the random error associated with the $i$ -th observation in the $j$ -th group, assumed to be $\mathcal{N}(0, \sigma^2)$ .

Hypotheses

The null hypothesis ( $H_0$ ) states that all group population means are equal (or equivalently, all treatment effects are zero): $H_0: \mu_1 = \mu_2 = \dots = \mu_k \quad \text{or} \quad \tau_1 = \tau_2 = \dots = \tau_k = 0$

The alternative hypothesis ( $H_a$ ) states that at least one population mean is different: $H_a: \exists \ i, j \text{ such that } \mu_i \neq \mu_j$

Sums of Squares

The Total Sum of Squares ( $SST$ ) is partitioned into the Sum of Squares Between ( $SSB$ ) and the Sum of Squares Within ( $SSW$ , also known as Error Sum of Squares, $SSE$ ).

$SST = SSB + SSW$

Total Sum of Squares (SST) measures the total variation in the data: $SST = \sum_{j=1}^{k} \sum_{i=1}^{n_j} (y_{ij} - \bar{y}_{..})^2$ where $\bar{y}_{..}$ is the grand mean.

Sum of Squares Between (SSB) measures the variation of group means around the grand mean: $SSB = \sum_{j=1}^{k} n_j (\bar{y}_{.j} - \bar{y}_{..})^2$ where $\bar{y}_{.j}$ is the mean of the $j$ -th group and $n_j$ is the number of observations in the $j$ -th group.

Sum of Squares Within (SSW) measures the variation of individual observations around their respective group means: $SSW = \sum_{j=1}^{k} \sum_{i=1}^{n_j} (y_{ij} - \bar{y}_{.j})^2$

Degrees of Freedom and Mean Squares

Degrees of freedom ( $df$ ) are required to convert sums of squares into variances (mean squares). Let $N$ be the total sample size and $k$ be the number of groups.

$df_{Total} = N - 1$
$df_{Between} = k - 1$
$df_{Within} = N - k$

The Mean Squares ( $MS$ ) are calculated by dividing the Sum of Squares by their respective degrees of freedom:

$MSB = \frac{SSB}{k - 1}$ $MSW = \frac{SSW}{N - k}$

The F-Statistic

The test statistic for ANOVA is the ratio of the Mean Square Between to the Mean Square Within. Under the null hypothesis, both $MSB$ and $MSW$ are independent estimates of the population variance $\sigma^2$ , so their ratio follows an $F$ -distribution with $k-1$ and $N-k$ degrees of freedom.

$F = \frac{MSB}{MSW}$

If the $F$ -statistic is significantly larger than 1 (specifically, greater than the critical value from the $F$ -distribution for a given alpha level), the null hypothesis is rejected.

In a One-Way ANOVA with 4 groups and 40 total participants, what are the degrees of freedom for the F-statistic (numerator and denominator)?

Evaluating Teaching Methods

A university aims to determine if three different teaching methods (Standard Lecture, Flipped Classroom, Problem-Based Learning) result in different final exam scores. 90 students are randomly assigned to the three methods (30 per method). The resulting Sum of Squares Between (SSB) is calculated as 450, and the Sum of Squares Within (SSW) is 2610.

Calculate the Mean Square Between (MSB) and Mean Square Within (MSW).

Two-Way ANOVA

A Two-Way ANOVA analyzes the effect of two independent categorical variables (factors) on a continuous dependent variable. It fundamentally differs from running two independent One-Way ANOVAs because it evaluates the interaction effect between the two variables.

The statistical model for a Two-Way ANOVA with factors $A$ and $B$ , fixed effects, and with replication ( $n$ observations per cell) is:

$y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk}$

Where:

$y_{ijk}$ is the $k$ -th observation in the $i$ -th level of factor $A$ and $j$ -th level of factor $B$ .
$\mu$ is the overall population grand mean.
$\alpha_i$ is the main effect of factor A at level $i$ .
$\beta_j$ is the main effect of factor B at level $j$ .
$(\alpha\beta)_{ij}$ is the interaction effect between level $i$ of A and level $j$ of B.
$\varepsilon_{ijk}$ is the random error term, $\sim \mathcal{N}(0, \sigma^2)$ .

Interaction Effects

An interaction effect occurs when the effect of one independent variable on the dependent variable changes depending on the level of the other independent variable. Graphically, this is observed when the lines representing the means across levels of factors are not parallel (they may cross or diverge).

If the interaction effect is significant, interpreting the main effects (the individual effects of factor $A$ and factor $B$ ) becomes highly nuanced, as the main effects no longer fully describe the relationship.

Sums of Squares for Two-Way ANOVA

In a balanced design (equal sample sizes in all cells), the total variance is partitioned into four orthogonal components:

$SST = SSA + SSB + SSAB + SSE$

Where:

SSA: Sum of Squares for Factor A
SSB: Sum of Squares for Factor B
SSAB: Sum of Squares for the Interaction
SSE: Sum of Squares for Error (Within)

Degrees of freedom are similarly partitioned: Let $a$ be the number of levels of Factor A, $b$ be the number of levels of Factor B, and $n$ the number of replicates per cell. Total observations $N = a \times b \times n$ .

$df_A = a - 1$
$df_B = b - 1$
$df_{AB} = (a - 1)(b - 1)$
$df_E = ab(n - 1)$

Three distinct $F$ -tests are performed by dividing the corresponding Mean Square ( $MSA, MSB, MSAB$ ) by the Mean Square Error ( $MSE$ ):

$F_A = \frac{MSA}{MSE}, \quad F_B = \frac{MSB}{MSE}, \quad F_{AB} = \frac{MSAB}{MSE}$

In a Two-Way ANOVA, you are studying the effects of Diet (3 levels) and Exercise (2 levels) on weight loss. You have 10 participants per cell (6 cells total). What are the degrees of freedom for the interaction effect (Diet × Exercise)?

Post-Hoc Tests

A significant ANOVA only tells you that at least two means differ, not which means differ. To identify specific pairwise differences, post-hoc tests are required. Conducting multiple standard $t$ -tests inflates the family-wise error rate $\alpha_{FW}$ (the probability of making at least one Type I error across all tests).

$\alpha_{FW} = 1 - (1 - \alpha)^c$ where $c$ is the number of comparisons. For 5 groups, there are $c = \frac{5(4)}{2} = 10$ comparisons. If $\alpha=0.05$ per test, the family-wise error rate jumps to $1 - (0.95)^{10} \approx 0.40$ (assuming independence, which is an oversimplification but illustrates the inflation).

Common Post-Hoc Adjustments

Tukey’s Honestly Significant Difference (HSD): Compares all possible pairs of means. It is based on the studentized range distribution ( $q$ ) and provides tight control over the family-wise error rate when sample sizes are equal.
Bonferroni Correction: The most conservative method. It simply divides the desired family-wise alpha level by the number of comparisons: $\alpha_{corrected} = \frac{\alpha}{c}$ . While it strictly prevents Type I errors, it severely impacts statistical power (increasing Type II errors).
Scheffé’s Method: Used for all possible linear contrasts, not just pairwise comparisons. It is the most conservative post-hoc test when performing purely pairwise comparisons, but is highly flexible.

Which of the following correction methods is considered the most conservative and provides the lowest statistical power for detecting genuine differences?

Effect Size

The $p$ -value from an $F$ -test indicates statistical significance but not practical significance. Effect size metrics quantify the magnitude of the differences between groups.

Eta-Squared ( $\eta^2$ )

Eta-squared represents the proportion of total variance in the dependent variable that is associated with membership in the different groups defined by the independent variable.

$\eta^2 = \frac{SSB}{SST}$

While intuitive, $\eta^2$ is an upwardly biased estimator of the population effect size (it tends to overestimate).

Partial Eta-Squared ( $\eta_p^2$ )

In multi-factor designs (like Two-Way ANOVA), $\eta^2$ can be misleading because the effects of one factor reduce the variance available to be explained by another. Partial eta-squared isolates the variance explained by a specific factor relative to the unexplained variance (error) and the variance of that specific factor.

$\eta_p^2 = \frac{SS_{effect}}{SS_{effect} + SSE}$

Omega-Squared ( $\omega^2$ )

Omega-squared is a more complex but unbiased estimator of the population variance explained. It corrects for the bias present in $\eta^2$ by incorporating degrees of freedom and Mean Square terms.

$\omega^2 = \frac{SSB - df_{Between} \times MSW}{SST + MSW}$

Interpreting Effect Sizes in a Multi-Factor Design

A researcher conducts a Two-Way ANOVA assessing the impact of Drug Dosage (A) and Therapy (B) on symptom reduction. The output yields the following sums of squares: SSA = 400, SSB = 100, SSAB = 50, SSE = 450. Total SST = 1000.

Calculate the eta-squared (η²) for Drug Dosage (A).

Repeated Measures ANOVA

Repeated Measures ANOVA is the equivalent of the one-way ANOVA, but for related, not independent groups. It is the extension of the dependent (paired) $t$ -test. Examples include measuring the same participants across multiple time points (e.g., Blood pressure at baseline, week 1, and week 2) or exposing the same participants to all conditions in an experiment.

The key advantage of Repeated Measures ANOVA is that it removes variance attributable to individual differences from the Error Sum of Squares. This typically makes the analysis much more powerful (higher probability of detecting a true effect) than a standard independent-samples ANOVA.

$SST = SS_{Between Subjects} + SS_{Within Subjects}$ $SS_{Within Subjects} = SS_{Treatment} + SS_{Error}$

The Assumption of Sphericity

Repeated measures designs require the assumption of Sphericity. Sphericity requires that the variances of the differences between all pairs of related groups are equal. It is evaluated using Mauchly’s Test of Sphericity.

If the assumption of sphericity is violated (Mauchly’s Test $p < 0.05$ ), the Type I error rate inflates. To correct this, the degrees of freedom are adjusted downwards. Common corrections include:

Greenhouse-Geisser Correction: The most conservative correction. Used when sphericity is severely violated (epsilon $\epsilon < 0.75$ ).
Huynh-Feldt Correction: Less conservative, used when sphericity violation is mild (epsilon $\epsilon > 0.75$ ).

If $\epsilon$ is close to 1, the sphericity assumption holds perfectly. The corrections effectively increase the critical $F$ -value required for significance by artificially reducing the degrees of freedom.

A researcher conducts a repeated measures ANOVA and finds that Mauchly's Test is highly significant (p < .001), yielding a Greenhouse-Geisser epsilon (ε) of 0.52. What action should be taken?

Section Detail

Time Series Analysis

A time series is a sequence of data points indexed in time order. Formally, a time series is a stochastic process $(X_t)$ for $t \in T$ , where $T$ is an index set, typically $\mathbb{Z}$ or $\mathbb{N}$ for discrete-time time series. Analysis of time series involves understanding the underlying structure and function that produced the data, often for the purpose of forecasting future values.

The foundational assumption in many time series models is stationarity. A time series $(X_t)$ is strictly stationary if the joint distribution of $(X_{t_1}, X_{t_2}, \dots, X_{t_k})$ is identical to that of $(X_{t_1+\tau}, X_{t_2+\tau}, \dots, X_{t_k+\tau})$ for all $t_1, \dots, t_k, \tau \in T$ .

In practice, strict stationarity is often too restrictive. Weak stationarity (or wide-sense stationarity) requires only that the first two moments are invariant with respect to time translation:

$\mathbb{E}[X_t] = \mu$ for all $t \in T$ .
$\text{Cov}(X_t, X_{t+\tau}) = \gamma(\tau)$ for all $t, \tau \in T$ . The function $\gamma(\tau)$ is the autocovariance function at lag $\tau$ . The autocorrelation function (ACF) is defined as $\rho(\tau) = \frac{\gamma(\tau)}{\gamma(0)}$ .

Foundational Time Series Processes

White Noise

A sequence of uncorrelated random variables $(w_t)$ with mean zero and finite, constant variance $\sigma_w^2$ is termed a white noise process, denoted $w_t \sim WN(0, \sigma_w^2)$ . The autocovariance function for white noise is given by $\gamma(\tau) = \sigma_w^2$ if $\tau = 0$ , and $0$ otherwise. When the process $w_t$ consists of independent and identically distributed (i.i.d.) random variables, it is termed strictly white noise. Gaussian white noise assumes $w_t \sim \mathcal{N}(0, \sigma_w^2)$ .

Random Walk

A random walk is defined by the process $X_t = X_{t-1} + w_t$ , where $w_t \sim WN(0, \sigma_w^2)$ . Expanding this equation yields $X_t = \sum_{j=1}^t w_j$ (assuming $X_0 = 0$ ). The expected value is $\mathbb{E}[X_t] = 0$ , but the variance is $\text{Var}(X_t) = t \sigma_w^2$ . Because the variance is strictly dependent on $t$ , a random walk is non-stationary. The covariance between $X_t$ and $X_s$ (where $t > s$ ) is $s \sigma_w^2$ .

Which of the following processes is strictly stationary?

Linear Models: AR, MA, and ARMA

Linear time series models capture the linear dependencies between observations.

Autoregressive (AR) Models

An autoregressive model of order $p$ , denoted AR( $p$ ), models the current value $X_t$ as a linear combination of its $p$ previous values plus a white noise term: $X_t = c + \phi_1 X_{t-1} + \phi_2 X_{t-2} + \dots + \phi_p X_{t-p} + w_t$ Using the backshift operator $B$ , where $B^k X_t = X_{t-k}$ , the AR( $p$ ) model implies: $\Phi(B) X_t = c + w_t$ where $\Phi(B) = 1 - \phi_1 B - \phi_2 B^2 - \dots - \phi_p B^p$ is the autoregressive polynomial. For an AR( $p$ ) process to be stationary, all roots of the characteristic equation $\Phi(z) = 0$ must lie outside the unit circle in the complex plane ( $|z| > 1$ ). For an AR(1) process $X_t = \phi_1 X_{t-1} + w_t$ , the condition simplifies to $|\phi_1| < 1$ , yielding ACF $\rho(\tau) = \phi_1^{|\tau|}$ .

Moving Average (MA) Models

A moving average model of order $q$ , denoted MA( $q$ ), expresses $X_t$ as a linear combination of the current and $q$ previous white noise terms: $X_t = \mu + w_t + \theta_1 w_{t-1} + \dots + \theta_q w_{t-q}$ Using the moving average polynomial $\Theta(B) = 1 + \theta_1 B + \dots + \theta_q B^q$ , this is written as $X_t = \mu + \Theta(B) w_t$ . Every finite-order MA process is stationary because it is a finite linear combination of stationary white noise processes. The autocovariance $\gamma(\tau) = 0$ for $|\tau| > q$ , dictating that the ACF cuts off after lag $q$ .

Invertibility of an MA process ensures that it can be uniquely expressed as an infinite-order AR process. An MA( $q$ ) model is invertible if all roots of $\Theta(z) = 0$ lie outside the unit circle.

ARMA and ARIMA Models

Combining AR and MA concepts forms the Autoregressive Moving Average model, ARMA( $p, q$ ): $\Phi(B) X_t = c + \Theta(B) w_t$

Stationarity and invertibility of the ARMA process depend on the roots of $\Phi(z)$ and $\Theta(z)$ respectively. Time series exhibiting non-stationarity in the mean, such as trends, require differencing. First-order differencing $\nabla X_t = X_t - X_{t-1} = (1-B)X_t$ removes linear trends; second-order removes quadratic trends. Applying $d$ differences produces an Autoregressive Integrated Moving Average model, ARIMA( $p, d, q$ ): $\Phi(B) (1-B)^d X_t = c + \Theta(B) w_t$

Modeling Exchange Rate Fluctuations

You are building a time series model for daily foreign exchange rates between USD and EUR. The log daily prices P_t exhibit a wandering behavior resembling a random walk. When you plot the differences X_t = log(P_t) - log(P_{t-1}), the resulting series mean-reverts to zero. The ACF of X_t shows significant spikes at lags 1 and 2, but vanishes to zero afterwards. The Partial Autocorrelation Function (PACF) gradually decays toward zero.

Based on the properties of X_t, what ARIMA model structure best represents the log price process P_t?

Partial Autocorrelation Function (PACF)

While the ACF measures the linear dependence between $X_t$ and $X_{t+\tau}$ inclusive of intermediate effects, the Partial Autocorrelation Function (PACF) isolates the direct correlation. The PACF at lag $\tau$ , denoted $\phi_{\tau\tau}$ , represents the correlation between $X_t$ and $X_{t+\tau}$ after removing the linear dependence of both variables on the intermediate values $X_{t+1}, \dots, X_{t+\tau-1}$ .

For an AR( $p$ ) process, the PACF cuts off strictly after lag $p$ ( $\phi_{\tau\tau} = 0$ for $\tau > p$ ). Conversely, for an MA( $q$ ) process, the PACF tails off gradually. This dualistic behavior provides the foundation for the Box-Jenkins model identification methodology.

Spectral Analysis

Time domain analysis emphasizes serial correlations over time lags. Spectral analysis (frequency domain analysis) decomposes the variance of a time series over a continuous spectrum of angular frequencies $\omega \in [-\pi, \pi]$ . For a stationary process with autocovariance function $\gamma(\tau)$ , the spectral density function $f(\omega)$ represents the Fourier transform of the autocovariance sequence: $f(\omega) = \frac{1}{2\pi} \sum_{\tau=-\infty}^\infty \gamma(\tau) e^{-i \omega \tau}$ The total variance of the process corresponds to the integral over the frequency band: $\gamma(0) = \int_{-\pi}^\pi f(\omega) d\omega$ A peak at a specific frequency $\omega_0$ in the spectral density plot implies periodic behavior with cycle length $\frac{2\pi}{\omega_0}$ . For Gaussian white noise, $\gamma(\tau)$ is absolute zero at all $\tau \neq 0$ , rendering the spectral density perfectly flat: $f(\omega) = \frac{\sigma_w^2}{2\pi}$ .

Filtering Operations in the frequency domain allow straightforward manipulation of time series signals. An LTI (Linear Time-Invariant) filter defined by sequence $(a_j)$ applies the convolution $Y_t = \sum_j a_j X_{t-j}$ . The frequency response function of the filter is $A(\omega) = \sum_j a_j e^{-i \omega j}$ . The spectral density of the filtered output modifies according to: $f_Y(\omega) = |A(\omega)|^2 f_X(\omega)$

Multivariate Time Series and Vector Autoregression (VAR)

When assessing joint dynamics of multiple interrelated time series $\mathbf{X}_t = (X_{1t}, X_{2t}, \dots, X_{kt})^\top$ , univariate ARIMA models are insufficient. The Vector Autoregressive model of order $p$ , VAR( $p$ ), generalizes the AR structure to dimension $k$ : $\mathbf{X}_t = \mathbf{c} + \mathbf{\Phi}_1 \mathbf{X}_{t-1} + \dots + \mathbf{\Phi}_p \mathbf{X}_{t-p} + \mathbf{w}_t$ where $\mathbf{\Phi}_i$ are $k \times k$ coefficient matrices and $\mathbf{w}_t$ is a $k$ -dimensional multivariate white noise zero-mean vector strictly characterized by the covariance matrix $\mathbf{\Sigma}$ .

Stationarity in a VAR system demands that roots of the determinant equation $|\mathbf{I}_k - \mathbf{\Phi}_1 z - \dots - \mathbf{\Phi}_p z^p| = 0$ fall strictly outside the complex unit circle. VAR models naturally represent Granger causality: $X_1$ Granger-causes $X_2$ if the past observations of $X_1$ statistically improve the prediction horizon for $X_2$ compared to strict reliance on the isolated past of $X_2$ .

State-Space Models and the Kalman Filter

A more generalized analytic framework is provided by State-Space Modeling. A state-space model characterizes observation dynamics through an underlying, unobserved state variable sequence $\mathbf{\alpha}_t$ . The process divides into deterministic functional dependencies:

Measurement Equation: Links observed data $\mathbf{y}_t$ to the unobserved state. $\mathbf{y}_t = \mathbf{Z}_t \mathbf{\alpha}_t + \mathbf{\epsilon}_t, \quad \mathbf{\epsilon}_t \sim \mathcal{N}(0, \mathbf{H}_t)$
State Equation (Transition Equation): Governs Markovian state evolution over sequence steps. $\mathbf{\alpha}_{t+1} = \mathbf{T}_t \mathbf{\alpha}_t + \mathbf{\eta}_t, \quad \mathbf{\eta}_t \sim \mathcal{N}(0, \mathbf{Q}_t)$

Here, $\mathbf{\epsilon}_t$ specifies observation measurement noise, and $\mathbf{\eta}_t$ structural transition disturbance. Matrices $\mathbf{Z}_t, \mathbf{T}_t, \mathbf{H}_t, \mathbf{Q}_t$ configure the parameters of dynamic correlation.

The Kalman filter supplies a recursive mechanism for determining the optimal minimum mean-squared error (MMSE) estimator for the state vector $\mathbf{\alpha}_t$ given the accrued observation sequence up to time $t$ , $Y_t = y_1, ..., y_t$ . The calculation iterates between the prediction step and optimal update (correction) computation involving the Kalman gain component modifying the prediction based on observed innovation error.

In a generic linear state-space model evaluated using the Kalman filter framework, which sequence step incorporates information exclusively derived from novel observations y_t not previously included structurally?

Structural Breakpoints and Non-Linearities

Standard parametric assumptions often fail mapping prolonged macroeconomic sequences due to fundamental shifts in generating mechanisms. A structural breakpoint models definitive shifts within the parameter spaces governing stationary dynamics. Formally evaluating structural sequence integrity requires analyzing sequence partitions mapping varying ARMA polynomials strictly restricted within designated time indices corresponding to systemic shocks.

Alternatively, Arch/GARCH frameworks directly model phenomena demonstrating localized heteroskedasticity. The Generalized Autoregressive Conditional Heteroskedasticity framework models the distinct variance sequence $\sigma_t^2$ dynamically: $X_t = \sigma_t z_t \quad (z_t \sim WN(0, 1))$ $\sigma_t^2 = \omega + \sum_{i=1}^q \alpha_i X_{t-i}^2 + \sum_{j=1}^p \beta_j \sigma_{t-j}^2$ The GARCH formulation precisely quantifies volatility clustering characterizations fundamentally essential to contemporary financial risk modeling frameworks.

Advanced paradigms increasingly rely upon threshold autoregressive paradigms (TAR) addressing non-linear functional manifestations, or fractional integration models (ARFIMA) structurally designed for mapping processes exhibiting exceptionally protracted long-range dependency characterized by exceptionally slowed hyperbolic ACF exponential decay functions.

Section Detail

Non-Parametric Statistics

Statistical inference often relies on parametric assumptions, specifically that the population from which the sample is drawn follows a known probability distribution, typically the normal distribution, characterized by a set of parameters (e.g., mean $\mu$ and variance $\sigma^2$ ). Non-parametric statistics, in contrast, provide procedures for inferring properties of populations that do not rely on restrictive assumptions regarding the underlying parameterized probability distributions.

These methods are essential when sample sizes are small, data are ordinal or nominal, or severe departures from normality are evident. While non-parametric tests are more robust to distributional violations, they generally possess less statistical power compared to their parametric counterparts when the parametric assumptions are actually met.

The Sign Test

The sign test is one of the simplest non-parametric tests, used to assess whether the median of a continuous distribution equals a hypothesized value $M_0$ . It is the non-parametric alternative to the one-sample t-test.

Let $X_1, X_2, \dots, X_n$ be a random sample from a continuous distribution with median $M$ . We wish to test the null hypothesis $H_0: M = M_0$ .

The test statistic $S$ is defined as the number of sample observations strictly greater than $M_0$ . Under $H_0$ , each observation has a 0.5 probability of being greater than $M_0$ , assuming continuity. Thus, $S$ follows a binomial distribution: $S \sim \text{Binomial}(N, p = 0.5)$ where $N$ is the effective sample size, discarding any ties where $X_i = M_0$ .

For large $N$ (typically $N > 20$ ), a normal approximation can be used: $Z = \frac{S - \frac{N}{2}}{\sqrt{\frac{N}{4}}} \sim \mathcal{N}(0, 1)$ A continuity correction of $0.5$ is often applied to $S$ for greater accuracy.

Why might the sign test discard observations equal to the hypothesized median $M_0$?

Wilcoxon Signed-Rank Test

The sign test ignores the magnitude of the differences between the observations and the hypothesized median. The Wilcoxon signed-rank test incorporates this magnitude, requiring the assumption that the underlying continuous distribution is symmetric about its median. It serves as a more powerful non-parametric alternative to the paired Student’s t-test or the one-sample t-test.

Given pairs of observations $(X_i, Y_i)$ for $i = 1, \dots, n$ , compute the differences $D_i = X_i - Y_i$ .

Discard pairs where $D_i = 0$ . Let $N$ be the reduced sample size.
Rank the absolute differences $|D_i|$ from smallest to largest. Ties are assigned the average of the ranks they would have received. Let $R_i$ be the rank of $|D_i|$ .
Calculate the test statistic $W$ , which is the sum of the signed ranks: $W = \sum_{i=1}^{N} \text{sgn}(D_i) R_i$ Alternatively, calculate the sum of ranks for positive differences ( $T^+$ ) and negative differences ( $T^-$ ). The test statistic is often defined as $T = \min(T^+, T^-)$ .

Under $H_0$ (symmetric distribution about 0), the expected value and variance of $W$ are: $\mathbb{E}[W] = 0$ $\text{Var}(W) = \frac{N(N+1)(2N+1)}{6}$ For large $N$ , $W$ is approximately normally distributed, permitting the use of a $Z$ -test.

Mann-Whitney U Test (Wilcoxon Rank-Sum Test)

When comparing two independent samples to determine if they originate from the same population, the Mann-Whitney U test (or Wilcoxon rank-sum test) offers a non-parametric alternative to the independent two-sample t-test. It assumes the two distributions are identical in shape but potentially shifted in location.

Let $X_1, \dots, X_m$ and $Y_1, \dots, Y_n$ be independent samples.

Combine all $m+n$ observations and rank them from $1$ to $m+n$ .
Compute the sum of the ranks for sample 1 ( $R_1$ ) and sample 2 ( $R_2$ ).
The $U$ statistics are calculated as: $U_1 = R_1 - \frac{m(m+1)}{2}$ $U_2 = R_2 - \frac{n(n+1)}{2}$ Note that $U_1 + U_2 = mn$ . The test statistic is $U = \min(U_1, U_2)$ .

Under the null hypothesis that $X$ and $Y$ have the same distribution, the expectation and variance of $U$ are: $\mathbb{E}[U] = \frac{mn}{2}$ $\text{Var}(U) = \frac{mn(m+n+1)}{12}$ Ties in the data require an adjustment to the variance formula: $\text{Var}(U) = \frac{mn}{12} \left( (m+n+1) - \sum_{i=1}^k \frac{t_i^3 - t_i}{(m+n)(m+n-1)} \right)$ where $k$ is the number of tied groups and $t_i$ is the number of observations in the $i$ -th tied group.

What condition reduces the power of the Mann-Whitney U test relative to an independent two-sample t-test?

Kruskal-Wallis one-way analysis of variance

The Kruskal-Wallis H test extends the Mann-Whitney U test to more than two independent groups. It is the non-parametric equivalent of the one-way ANOVA, testing whether $k$ independent samples originate from the same distribution.

Given $k$ groups with sample sizes $n_1, n_2, \dots, n_k$ and total observations $N = \sum_{i=1}^k n_i$ :

Rank all $N$ observations jointly from $1$ to $N$ .
Compute the sum of ranks $R_i$ for each group $i$ .
The test statistic $H$ is: $H = \frac{12}{N(N+1)} \sum_{i=1}^k \frac{R_i^2}{n_i} - 3(N+1)$

If the null hypothesis is true (all samples come from the same population) and the sample sizes are sufficiently large (typically $n_i \geq 5$ ), $H$ is approximately distributed as a chi-square distribution with $k-1$ degrees of freedom: $H \sim \chi^2_{k-1}$ If the null hypothesis is rejected, post-hoc procedures like Dunn’s test are utilized for pairwise comparisons to isolate the specific stochastic dominance among groups.

Spearman’s Rank Correlation Coefficient

Evaluating the strength and direction of association between two continuous or ordinal variables without assuming linearity relies on Spearman’s rank correlation coefficient ( $\rho$ or $r_s$ ). It evaluates the monotonic relationship between two variables, contrasting with Pearson’s correlation which evaluates linear relationships.

For $n$ pairs of observations $(X_i, Y_i)$ , convert the raw scores to ranks $R(X_i)$ and $R(Y_i)$ . Spearman’s $\rho$ is computed analogously to Pearson’s correlation coefficient, but applied to the ranks: $\rho = 1 - \frac{6 \sum_{i=1}^n d_i^2}{n(n^2 - 1)}$ where $d_i = R(X_i) - R(Y_i)$ is the difference between the ranks of corresponding variables.

If there are identical values (ties), the simplified formula utilizing $d_i^2$ becomes inaccurate, and the standard Pearson correlation formula must be applied directly to the ranked variables.

$\rho = \frac{\sum_i (R(X_i) - \bar{R}(X))(R(Y_i) - \bar{R}(Y))}{\sqrt{\sum_i (R(X_i) - \bar{R}(X))^2 \sum_i (R(Y_i) - \bar{R}(Y))^2}}$

Values of $\rho$ vary from $-1$ to $+1$ , indicating perfect negative or positive monotonic associations, respectively.

Bootstrap and Resampling Methods

Modern computational power enables simulation-based non-parametric approaches, most notably bootstrapping. Introduced by Bradley Efron, bootstrapping relies on random sampling with replacement from the original dataset.

If we possess a sample $X = \{x_1, \dots, x_n\}$ drawn from an unknown distribution $F$ , we construct an empirical distribution function $\hat{F}$ . By drawing repeated samples of size $n$ , with replacement, from $X$ , we generate $B$ bootstrap samples $X^{*1}, X^{*2}, \dots, X^{*B}$ .

For a sample statistic $\hat{\theta} = s(X)$ estimating a parameter $\theta$ , we compute the statistic for each bootstrap sample: $\hat{\theta}^{*b} = s(X^{*b})$ . The distribution of $\hat{\theta}^{*b}$ approximates the sampling distribution of $\hat{\theta}$ , enabling the construction of confidence intervals and hypothesis testing lacking parametric form.

The bootstrap standard error is the standard deviation of the bootstrap replicates: $\widehat{\text{SE}}(\hat{\theta}) = \sqrt{\frac{1}{B-1} \sum_{b=1}^B \left( \hat{\theta}^{*b} - \bar{\hat{\theta}}^* \right)^2 }$ where $\bar{\hat{\theta}}^*$ is the mean of the bootstrap estimates. Resampling procedures eliminate reliance on asymptotic normality assumptions, providing robust inferences particularly suitable for complex estimators or small sample sizes limit conventional asymptotic theory.

Kernel Density Estimation

Kernel Density Estimation (KDE) establishes a non-parametric perspective on estimating the probability density function of a continuous random variable. Parametric estimation fits a predetermined shape (e.g., normal, gamma) parameterized by equations. KDE estimates the density entirely from data.

Let $(x_1, x_2, \dots, x_n)$ be independent and identically distributed samples drawn from some distribution with an unknown density $f$ . The kernel density estimator is: $\hat{f}_h(x) = \frac{1}{nh} \sum_{i=1}^n K\left(\frac{x - x_i}{h}\right)$ where $K$ constitutes the kernel (a non-negative function integrating to one) and $h > 0$ denotes a smoothing parameter known as the bandwidth. The bandwidth heavily influences the estimator. Small $h$ induces undersmoothing, yielding high variance (spurious fluctuations), whereas large $h$ evokes oversmoothing, yielding high bias (obscuring structural features of the distribution). Standard choices for $K$ include the Gaussian, Epanechnikov, and uniform kernels.

KDE vs. Histogram

Histograms and KDEs both attempt to model data density non-parametrically. Consider a dataset of highly clustered continuous physical measurements. A histogram forces boundaries at arbitrary bin edges. A KDE smooths out data without fixed bins.

Probability & Statistics

Contents

Overview

Statistics

Overview

Overview

The Probability Space

Which of the following is NOT required for a collection of subsets to form a \sigma-algebra?

Independence and Conditional Probability

Using Bayes' Theorem, what is the exact probability that the individual actually has the disease?

Random Variables and Distributions

Discrete vs. Continuous Distributions

Which of the following statements about the Cumulative Distribution Function (CDF) is always mathematically accurate for any random variable?

Expected Value: The Lebesgue Perspective

Variance and Moments

What is the variance of the daily return of this portfolio P = 3A - 1B?

Limits and Asymptotic Theorems

What is the primary condition required by the Central Limit Theorem for the sample average sequence to converge to a normal distribution?

The Null and Alternative Hypotheses

Decision Errors in Inference

Type I Error (α\alpha)

Type II Error (β\beta)

In a criminal trial setting where $H_0$ is 'the defendant is innocent', what is the consequence of a Type I error?

Statistical Power

Test Statistics and the Z-Test

The Rejection Region (Critical Value Approach)

Based on the sample data, what is the value of the test statistic $Z$, and does the engineer reject the null hypothesis?

The P-Value Approach

A researcher conducts a hypothesis test and obtains a p-value of 0.034. Does this mean there is a 3.4% chance that the null hypothesis is true?

The Student’s t-Test

Multiple Hypothesis Testing

The Bonferroni Correction

Statistics

Statistical Inference

Point Estimation

Desirable Properties of Point Estimators

Method of Moments

Consider a sample $X_1, \dots, X_n$ from a continuous Uniform $(0, \theta)$ distribution. What is the expected value $\mathbb{E}[X]$ and the corresponding Method of Moments estimator for $\theta$?

Maximum Likelihood Estimation (MLE)

Properties of the MLE

Determine the MLE $\\hat{\\theta}_{MLE}$ and compare it to the MoM estimator.

Sufficiency

The Rao-Blackwell Theorem

The Cramér-Rao Lower Bound (CRLB)

Confidence Intervals Construction

The Pivot Method

Define a suitable pivotal quantity and construct the confidence interval.

Asymptotic Confidence Intervals

Why can it be problematic to state 'There is a 95% probability that the true parameter lies between 4.2 and 5.8' after substituting data to calculate a 95% confidence interval [4.2, 5.8] from a given dataset?

Summary of Estimator Selection

Bayesian Statistics

The Foundation: Bayes’ Theorem

Frequentist vs. Bayesian Comparison

The Role and Selection of Priors

Informative vs. Uninformative Priors

Conjugate Priors

Jeffreys Prior

Computational Bayesian Inference: MCMC and Gibbs Sampling

Markov Chain Monte Carlo

Gibbs Sampling

Calculate the Posterior probability that the patient has the genetic marker given the positive result.

Implementation: Bayesian Continuous Updating

Interactive Lab

Exercises

In the context of Bayesian statistics, what is the defining characteristic of a Conjugate Prior?

How does Gibbs Sampling simplify the process of evaluating a complex, high-dimensional posterior distribution?

Which interpretation correctly identifies a key difference between frequentist Confidence Intervals and Bayesian Credible Intervals?

Markov Chains

Discrete-Time Markov Chains (DTMC)

Transition Matrices

nn-Step Transition Probabilities

If the transition matrix $P$ of a 3-state Markov chain has row sums of 1, what must be true about the row sums of $P^2$?

Chapman-Kolmogorov Equations

Classification of States

Accessibility and Communication

Recurrent and Transient States

Periodicity

Ergodic States

Stationary distributions

We are analyzing classification of states. Are the transient states guaranteed to be left forever, and what is the nature of states 0 and N within the context of state classifications?

Type I Error ( $\alpha$ )

Type II Error ( $\beta$ )

$n$ -Step Transition Probabilities

Coefficient of Determination ( $R^2$ )

Eta-Squared ( $\eta^2$ )

Partial Eta-Squared ( $\eta_p^2$ )

Omega-Squared ( $\omega^2$ )