Bayesian Statistics

Frequentist statistics interprets probability strictly as the long-run expected frequency of repeatable events. Bayesian statistics interprets probability fundamentally differently: as a degree of belief or a quantification of uncertainty. The Bayesian paradigm provides a rigorous mathematical framework for evaluating and updating our state of knowledge as new data becomes available.

The Foundation: Bayes’ Theorem

The core operating principle of Bayesian inference is Bayes’ Theorem, a mathematical identity derived from the definition of conditional probability:

$P(H|D) = \frac{P(D|H) \cdot P(H)}{P(D)}$

Where:

$P(H|D)$ (Posterior): The probability of the hypothesis $H$ after observing data $D$ . This represents the updated state of belief.
$P(D|H)$ (Likelihood): The probability of observing the data $D$ assuming the hypothesis $H$ is true. This quantifies the evidence generated by the data.
$P(H)$ (Prior): The initial degree of belief in the hypothesis $H$ before observing the data $D$ .
$P(D)$ (Evidence or Marginal Likelihood): The total probability of observing the data across all possible hypotheses. It acts as a normalizing constant to ensure the posterior is a valid probability distribution: $P(D) = \int P(D|H)P(H)dH$ .

Because the denominator $P(D)$ does not depend on $H$ , Bayes’ theorem is often written as a proportionality:

$\text{Posterior} \propto \text{Likelihood} \times \text{Prior}$

Frequentist vs. Bayesian Comparison

The differences between the two schools of thought run deep, impacting how inference is conducted and interpreted.

Parameters: In frequentist statistics, parameters (like the true mean $\mu$ of a population) are fixed but unknown constants. In Bayesian statistics, parameters are treated as random variables described by probability distributions.
Data: Frequentists view the observed data as one possible realization from an infinite sequence of hypothetical repetitions. Bayesians treat the observed data as fixed and use it to calculate the probability of the parameter taking on various values.
Confidence Intervals vs. Credible Intervals: A frequentist 95% confidence interval means that if the experiment were repeated infinitely, 95% of the constructed intervals would contain the fixed parameter. A Bayesian 95% credible interval directly means there is a 95% probability that the parameter lies within that interval, given the observed data and prior belief.

The Role and Selection of Priors

The choice of the prior distribution $P(H)$ is a critical and sometimes criticized aspect of Bayesian analysis. Priors encode expert knowledge and initial assumptions.

Informative vs. Uninformative Priors

An informative prior asserts specific, strong beliefs about the parameter space. For example, if measuring human height, a prior tightly clustered around $1.7$ meters is highly informative. An uninformative (or diffuse) prior spreads probability mass across the parameter space, attempting to let the data “speak for itself.” A uniform distribution is a common example, though true non-informativeness is mathematically subtle.

Conjugate Priors

A prior is conjugate to a specific likelihood function if the resulting posterior distribution belongs to the same probability family as the prior. Conjugacy provides immense mathematical convenience because the posterior can be derived algebraically without complex numerical integration.

Examples of natural conjugate pairs include:

Beta Prior & Binomial Likelihood $\rightarrow$ Beta Posterior. (Used for probabilities and proportions).
Normal Prior & Normal Likelihood (known variance) $\rightarrow$ Normal Posterior. (Used for continuous mean estimation).
Gamma Prior & Poisson Likelihood $\rightarrow$ Gamma Posterior. (Used for rate parameter estimation).

Consider the Beta-Binomial model. If the prior for the probability of success $\theta$ is $\text{Beta}(\alpha, \beta)$ and the newly observed data $D$ contains $y$ successes and $n-y$ failures, the posterior is simply:

$P(\theta | y) \sim \text{Beta}(\alpha + y, \beta + n - y)$

Jeffreys Prior

When seeking an uninformative prior, a flat uniform distribution can be problematic because it is not invariant under parameter transformations (e.g., a uniform prior on the standard deviation $\sigma$ is not uniform on the variance $\sigma^2$ ). The Jeffreys Prior solves this by deriving the prior directly from the Fisher Information $I(\theta)$ of the likelihood function:

$P(\theta) \propto \sqrt{\det(I(\theta))}$

This guarantees that the prior remains uninformative regardless of how the parameter is parameterized mathematically.

Computational Bayesian Inference: MCMC and Gibbs Sampling

Historically, the difficulty of computing the normalizing constant $P(D)$ analytically restricted Bayesian methods to conjugate models. The advent of modern computing and Markov Chain Monte Carlo (MCMC) algorithms revolutionized Bayesian statistics, allowing inference on virtually any model.

Markov Chain Monte Carlo

MCMC algorithms do not attempt to calculate the posterior distribution analytically. Instead, they draw a vast number of correlated samples directly from the posterior space. By analyzing these samples (e.g., taking the mean, variance, or percentiles of the samples), we can estimate the properties of the posterior distribution.

The algorithm constructs a Markov Chain—a sequence of states where the next state depends only on the current state—designed such that its stationary distribution is exactly the target posterior distribution.

Gibbs Sampling

A specialized and highly effective MCMC algorithm for multi-dimensional parameter spaces is Gibbs Sampling. Instead of trying to update all parameters $\theta_1, \theta_2, \ldots, \theta_k$ simultaneously, Gibbs sampling updates one parameter at a time by sampling from its conditional distribution, keeping all other parameters fixed at their current values.

Let $\theta = (\theta_1, \theta_2, \theta_3)$ . A Gibbs step involves:

Sample $\theta_1^{(i+1)}$ from $P(\theta_1 | \theta_2^{(i)}, \theta_3^{(i)}, D)$
Sample $\theta_2^{(i+1)}$ from $P(\theta_2 | \theta_1^{(i+1)}, \theta_3^{(i)}, D)$
Sample $\theta_3^{(i+1)}$ from $P(\theta_3 | \theta_1^{(i+1)}, \theta_2^{(i+1)}, D)$

This iterative process vastly simplifies the sampling problem because the one-dimensional conditional distributions are often well-known and easy to sample from, even when the joint multidimensional posterior is impossibly complex.

The Medical Test Paradox

You are a doctor administering a test for a rare genetic marker present in 0.1% (p=0.001) of the population. The test's sensitivity (true positive rate) is 99% (P(Positive|Marker) = 0.99). The test's specificity (true negative rate) is 98%, meaning the false positive rate is 2% (P(Positive|No Marker) = 0.02). A patient receives a positive test result. The patient immediately asks: 'What is the probability I actually have the marker?'

Calculate the Posterior probability that the patient has the genetic marker given the positive result.

Implementation: Bayesian Continuous Updating

Below is an illustration utilizing the Beta-Conjugate prior for a binomial likelihood, perfectly modeling the continuous updating of beliefs about a coin’s hidden fairness parameter. Observe how the posterior from one experiment becomes the prior for the next.

python

Interactive Lab

Read the code, make a small change, then run it and inspect the output. Runtime setup messages stay outside the terminal so the result remains focused on what the program prints.

Step 1

Inspect the idea

Step 2

Edit the program

Step 3

Run and compare

Bayesian Statistics

Bayesian Statistics

The Foundation: Bayes’ Theorem

Frequentist vs. Bayesian Comparison

The Role and Selection of Priors

Informative vs. Uninformative Priors

Conjugate Priors

Jeffreys Prior

Computational Bayesian Inference: MCMC and Gibbs Sampling

Markov Chain Monte Carlo

Gibbs Sampling

Calculate the Posterior probability that the patient has the genetic marker given the positive result.

Implementation: Bayesian Continuous Updating

Interactive Lab

Exercises

In the context of Bayesian statistics, what is the defining characteristic of a Conjugate Prior?

How does Gibbs Sampling simplify the process of evaluating a complex, high-dimensional posterior distribution?

Which interpretation correctly identifies a key difference between frequentist Confidence Intervals and Bayesian Credible Intervals?