Statistical Inference

Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution. Whereas probability theory deduces the behavior of a sample given known population parameters, statistical inference deduces the population parameters based on an observed sample.

Formally, we observe a sample $\mathbf{X} = (X_1, X_2, \dots, X_n)$ which we assume is generated from a probability model belonging to a known family of distributions $\mathcal{P} = \lbrace P_\theta : \theta \in \Theta \rbrace$ , where $\theta$ is an unknown parameter vector and $\Theta$ is the parameter space. The objective is to estimate $\theta$ or make decisions about it.

Point Estimation

A point estimator $\hat{\theta}$ is any statistic (a function of the data $\mathbf{X}$ that does not depend on any unknown parameters) used to infer the value of an unknown parameter $\theta$ in a statistical model. We denote the estimator as $\hat{\theta}(\mathbf{X})$ and the estimate (the realized value for a specific sample $\mathbf{x}$ ) as $\hat{\theta}(\mathbf{x})$ .

Desirable Properties of Point Estimators

How do we decide if an estimator $\hat{\theta}$ is “good”? We evaluate its statistical properties across all possible samples of size $n$ .

1. Unbiasedness: An estimator $\hat{\theta}$ is unbiased for $\theta$ if its expected value over all possible samples equals the true parameter value: $\mathbb{E}_\theta[\hat{\theta}] = \theta \quad \forall \theta \in \Theta$ The bias of an estimator is defined as $\text{Bias}(\hat{\theta}) = \mathbb{E}_\theta[\hat{\theta}] - \theta$ . While unbiasedness is intuitively appealing, it is not always strictly necessary, especially if allowing a small bias significantly reduces the estimation error.

2. Mean Squared Error (MSE): A common measure of the quality of an estimator is its Mean Squared Error: $\text{MSE}(\hat{\theta}) = \mathbb{E}_\theta[(\hat{\theta} - \theta)^2]$ Using the definitions of variance and bias, the MSE can be decomposed into: $\text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + (\text{Bias}(\hat{\theta}))^2$ If an estimator is unbiased, its MSE is exactly its variance.

3. Consistency: An estimator $\hat{\theta}_n$ (subscript $n$ emphasizes dependence on sample size) is consistent if it converges in probability to the true parameter value as the sample size $n \to \infty$ : $\forall \epsilon > 0, \lim_{n \to \infty} P_\theta(|\hat{\theta}_n - \theta| > \epsilon) = 0$ Consistency means that with an infinitely large amount of data, the estimator perfectly pinpoints the underlying parameter.

Method of Moments

The Method of Moments (MoM) is one of the oldest methods of deriving point estimators. It is based on equating the sample moments to the population moments, thereby obtaining a system of equations to solve for the unknown parameters.

The $k$ -th population moment is a function of the parameter vector $\theta$ : $\mu_k(\theta) = \mathbb{E}_\theta[X^k]$ The $k$ -th sample moment is calculated from the data: $m_k = \frac{1}{n} \sum_{i=1}^n X_i^k$

If we have $p$ unknown parameters, $\theta = (\theta_1, \theta_2, \dots, \theta_p)$ , we set up a system of $p$ equations: $\mu_j(\theta_1, \dots, \theta_p) = m_j \quad \text{for } j = 1, 2, \dots, p$ Solving this system yields the Method of Moments estimator $\hat{\theta}_{MoM}$ .

Consider a sample $X_1, \dots, X_n$ from a continuous Uniform $(0, \theta)$ distribution. What is the expected value $\mathbb{E}[X]$ and the corresponding Method of Moments estimator for $\theta$?

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation is a formal, unified approach to parameter estimation. It frames estimation as finding the parameter value that makes the observed data “most probable” or “most likely” to have occurred.

Let $f(x \mid \theta)$ be the probability density function (PDF) or probability mass function (PMF) of our distribution. Given an observed sample $\mathbf{x} = (x_1, \dots, x_n)$ of independent and identically distributed (i.i.d.) random variables, the likelihood function is the joint density evaluated at the observed data, viewed as a function of the parameter $\theta$ : $L(\theta \mid \mathbf{x}) = \prod_{i=1}^n f(x_i \mid \theta)$

The Maximum Likelihood Estimator $\hat{\theta}_{MLE}$ is the value $\theta \in \Theta$ that maximizes $L(\theta \mid \mathbf{x})$ . Because the natural logarithm is a strictly increasing function, it is computationally and analytically easier to maximize the log-likelihood function: $\ell(\theta \mid \mathbf{x}) = \ln L(\theta \mid \mathbf{x}) = \sum_{i=1}^n \ln f(x_i \mid \theta)$

Assuming standard regularity conditions (e.g., differentiability with respect to $\theta$ and the support of the distribution not depending on $\theta$ ), the MLE can be found by solving the score equation: $\frac{\partial}{\partial \theta} \ell(\theta \mid \mathbf{x}) = 0$ and verifying that the second derivative is negative (Concavity).

Properties of the MLE

Under mild regularity conditions, the MLE has remarkable asymptotic properties:

Consistency: $\hat{\theta}_{MLE} \xrightarrow{p} \theta$ .
Equivariance: If $g(\theta)$ is a function of $\theta$ , then the MLE of $g(\theta)$ is $g(\hat{\theta}_{MLE})$ .
Asymptotic Normality and Efficiency: The distribution of the MLE approaches a Normal distribution as $n \to \infty$ , and its asymptotic variance is the lowest possible variance among all consistent estimators (it achieves the Cramér-Rao lower bound asymptotically). $\sqrt{n}(\hat{\theta}_{MLE} - \theta) \xrightarrow{d} \mathcal{N}(0, I(\theta)^{-1})$ where $I(\theta)$ is the Fisher Information.

MLE vs MoM on the Uniform Distribution

We have an i.i.d. sample $X_1, X_2, \dots, X_n$ from $U(0, \\theta)$. We previously saw that the MoM estimator is $\\hat{\\theta}_{MoM} = 2\\bar{X}$. Now, let's derive the MLE.

Determine the MLE $\\hat{\\theta}_{MLE}$ and compare it to the MoM estimator.

Sufficiency

A statistic $T(\mathbf{X})$ is sufficient for $\theta$ if the conditional distribution of the sample $\mathbf{X}$ given $T(\mathbf{X})$ does not depend on $\theta$ . Intuitively, $T(\mathbf{X})$ contains all the information in the sample about $\theta$ ; no other function of the data can provide further insights regarding the value of $\theta$ .

Proving sufficiency directly via conditional probabilities can be tedious. Instead, we use the Fisher-Neyman Factorization Theorem: A statistic $T(\mathbf{X})$ is sufficient for $\theta$ if and only if the joint PDF (or PMF) of the sample can be factored into two components: $f(\mathbf{x} \mid \theta) = g(T(\mathbf{x}) \mid \theta) \cdot h(\mathbf{x})$ where $h(\mathbf{x})$ is a non-negative function that depends only on the data, and $g(T(\mathbf{x}) \mid \theta)$ is a non-negative function that depends on the parameter $\theta$ and the data $\mathbf{x}$ strictly through the statistic $T(\mathbf{x})$ .

The Rao-Blackwell Theorem

Sufficiency plays a vital role in optimal estimation. The Rao-Blackwell theorem formalizes this: if you have an unbiased estimator $\hat{\theta}$ and a sufficient statistic $T$ , the conditional expectation $\mathbb{E}[\hat{\theta} \mid T]$ defines a new estimator that is also unbiased and has a variance less than or equal to the variance of the original estimator $\hat{\theta}$ . Conclusively, optimal estimators should always be functions of a sufficient statistic.

The Cramér-Rao Lower Bound (CRLB)

When developing estimators, mathematical statisticians want to know the absolute best possible variance an unbiased estimator can achieve. Does an absolute limit exist, beyond which no estimator can improve?

Yes, under regularity conditions (primarily that the parameter space is an open interval and the support does not depend on $\theta$ ), the Cramér-Rao Lower Bound places a theoretical lower limit on the variance of any unbiased estimator $W(\mathbf{X})$ of a parameter $\tau(\theta)$ : $\text{Var}_\theta(W(\mathbf{X})) \ge \frac{[\tau'(\theta)]^2}{n I(\theta)}$ where $I(\theta)$ is the Fisher Information defined as: $I(\theta) = \mathbb{E}_\theta \left[ \left( \frac{\partial}{\partial \theta} \ln f(X \mid \theta) \right)^2 \right] = -\mathbb{E}_\theta \left[ \frac{\partial^2}{\partial \theta^2} \ln f(X \mid \theta) \right]$

If the variance of an unbiased estimator exactly equals the Cramér-Rao lower bound, it is deemed efficient (simultaneously proving it is the Uniformly Minimum Variance Unbiased Estimator - UMVUE). As noted earlier, Maximum Likelihood Estimators asymptotically achieve this lower bound, validating their massive prevalence in modern statistics.

Confidence Intervals Construction

While point estimators output a single best guess for a parameter ( $\hat{\theta}$ ), interval estimators yield a range of plausible values constructed such that the random interval covers the true parameter $\theta$ with a specified probability $1-\alpha$ , referred to as the confidence level.

Formally, a $1-\alpha$ confidence interval for $\theta$ is defined by two random variables $L(\mathbf{X})$ and $U(\mathbf{X})$ such that: $P_\theta\left( L(\mathbf{X}) \le \theta \le U(\mathbf{X}) \right) \ge 1 - \alpha \quad \forall \theta \in \Theta$

The Pivot Method

The most common technique to systematically derive confidence intervals relies on finding a pivotal quantity (or “pivot”). A random variable $Q(\mathbf{X}; \theta)$ is a pivot if:

It is a function of the sample $\mathbf{X}$ and the unknown parameter $\theta$ .
The probability distribution of $Q(\mathbf{X}; \theta)$ is completely independent of $\theta$ and any other unknown parameters.

If a pivot exists, constructing an interval estimator proceeds straightforwradly by finding constants $q_{\alpha/2}$ and $q_{1-\alpha/2}$ from the known distribution of $Q$ such that: $P\left( q_{\alpha/2} \le Q(\mathbf{X}; \theta) \le q_{1-\alpha/2} \right) = 1 - \alpha$ We then algebraically invert the inequalities inside the probability statement to isolate $\theta$ in the center: $P\left( L(\mathbf{X}) \le \theta \le U(\mathbf{X}) \right) = 1 - \alpha$

Deriving a Normal Confidence Interval via a Pivot

We have a sample $X_1, \\dots, X_n$ from a Normal distribution $\\mathcal{N}(\\mu, \\sigma^2)$ where variance $\\sigma^2$ is known and we must determine a $1-\\alpha$ confidence interval for the mean $\\mu$.

Define a suitable pivotal quantity and construct the confidence interval.

Asymptotic Confidence Intervals

When finite-sample pivot methods are intractable, statisticians leverage the asymptotical distribution of the Maximum Likelihood Estimator to construct approximate confidence regions. Since $\sqrt{n}(\hat{\theta}_{MLE} - \theta) \xrightarrow{d} \mathcal{N}(0, I(\hat{\theta}_{MLE})^{-1})$ , where $I(\hat{\theta}_{MLE})$ is the observed Fisher Information evaluated at the MLE, we use the asymptotic standard error: $SE(\hat{\theta}_{MLE}) \approx \frac{1}{\sqrt{n \cdot I(\hat{\theta}_{MLE})}}$ This yields the standard large-sample Wald confidence interval of the form: $\hat{\theta}_{MLE} \pm z_{\alpha/2} \cdot SE(\hat{\theta}_{MLE})$

Why can it be problematic to state 'There is a 95% probability that the true parameter lies between 4.2 and 5.8' after substituting data to calculate a 95% confidence interval [4.2, 5.8] from a given dataset?

Summary of Estimator Selection

Modern inference requires balancing various optimization properties:

Can we find an exact pivot for an interval, or must we rely on large sample sizes and Wald intervals?
Will the bias inherent to the MLE decay rapidly via consistency?
In highly complicated distributions where MLEs are not analytically enclosed, Method of Moments can act as a viable starting guess for numerical integration of the likelihood function.

Statistical inference provides the comprehensive foundation for drawing meaningful, mathematically strict conclusions from randomized noisy data under uncertainty.