Search Knowledge

© 2026 LIBREUNI PROJECT

Hypothesis Testing

Hypothesis testing is a formal mathematical framework for making inferential decisions about population parameters based on sample data. It provides a structured methodology to evaluate whether observed data yields sufficient evidence to reject a predefined baseline assumption.

The Null and Alternative Hypotheses

The foundation of any statistical test consists of two mutually exclusive statements about a population parameter: the null hypothesis (H0H_0) and the alternative hypothesis (HaH_a or H1H_1).

The null hypothesis (H0H_0) typically represents a state of no effect, no difference, or the historical baseline. It is the hypothesis that is assumed true until statistical evidence indicates otherwise.

The alternative hypothesis (HaH_a) represents the claim or theory that the researcher asserts is true, provided the sample data provides sufficient evidence to reject H0H_0.

For a population mean μ\mu evaluated against a hypothesized value μ0\mu_0, tests are formulated in one of three ways:

  1. Two-tailed test: H0:μ=μ0vs.Ha:μμ0H_0: \mu = \mu_0 \quad \text{vs.} \quad H_a: \mu \neq \mu_0
  2. Right-tailed test (Upper-tailed): H0:μμ0vs.Ha:μ>μ0H_0: \mu \le \mu_0 \quad \text{vs.} \quad H_a: \mu > \mu_0
  3. Left-tailed test (Lower-tailed): H0:μμ0vs.Ha:μ<μ0H_0: \mu \ge \mu_0 \quad \text{vs.} \quad H_a: \mu < \mu_0

The objective of the testing procedure is not to computationally “prove” H0H_0, but rather to determine if there is enough evidence to reject it in favor of HaH_a.

Decision Errors in Inference

Because hypothesis testing relies on sample data rather than an exhaustive population census, inferential decisions are subject to probabilistic errors.

Type I Error (α\alpha)

A Type I Error occurs when the null hypothesis is rejected when it is, in fact, true in the population. This is equivalent to a false positive. The probability of committing a Type I error is denoted by α\alpha, which is also strictly defined as the significance level of the test.

α=P(Reject H0H0 is true)\alpha = P(\text{Reject } H_0 \mid H_0 \text{ is true})

Type II Error (β\beta)

A Type II Error occurs when the null hypothesis is not rejected when the alternative hypothesis is true. This is a false negative. The probability of a Type II error is denoted by β\beta.

β=P(Fail to reject H0Ha is true)\beta = P(\text{Fail to reject } H_0 \mid H_a \text{ is true})

In a criminal trial setting where $H_0$ is 'the defendant is innocent', what is the consequence of a Type I error?

Statistical Power

The power of a statistical test is the probability of correctly rejecting a false null hypothesis. It is the compliment of the Type II error rate.

Power=1β=P(Reject H0Ha is true)\text{Power} = 1 - \beta = P(\text{Reject } H_0 \mid H_a \text{ is true})

Power depends on several factors: the significance level α\alpha, the sample size nn, the true effect size (the magnitude of the difference between the true parameter and μ0\mu_0), and the population variance σ2\sigma^2. Increasing sample size generally increases the power of a test.

Test Statistics and the Z-Test

A test statistic is a standardized value calculated from sample data during a hypothesis test. It measures the degree of agreement between the sample data and the null hypothesis.

Consider testing the mean of a normally distributed population with a known variance σ2\sigma^2. Let X1,X2,,XnX_1, X_2, \dots, X_n be an independent and identically distributed (i.i.d.) random sample from N(μ,σ2)N(\mu, \sigma^2). The sample mean Xˉ\bar{X} follows a normal distribution:

XˉN(μ,σ2n)\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)

Under the null hypothesis H0:μ=μ0H_0: \mu = \mu_0, the test statistic ZZ is constructed by standardizing Xˉ\bar{X}:

Z=Xˉμ0σnZ = \frac{\bar{X} - \mu_0}{\frac{\sigma}{\sqrt{n}}}

If H0H_0 is true, the test statistic ZZ follows a standard normal distribution, ZN(0,1)Z \sim N(0, 1). This distribution governs the probability of observing the test statistic.

The Rejection Region (Critical Value Approach)

The rejection region is the set of values for the test statistic that leads to the rejection of H0H_0. Its boundaries are determined by the critical values, which depend on the pre-specified significance level α\alpha and the directionality of the test.

For a two-tailed test at significance level α\alpha, the critical values are ±zα/2\pm z_{\alpha/2}. The decision rule is: Reject H0H_0 if Z>zα/2|Z| > z_{\alpha/2}.

For instance, when α=0.05\alpha = 0.05, z0.0251.96z_{0.025} \approx 1.96. Therefore, if the calculated ZZ falls outside the interval [1.96,1.96][-1.96, 1.96], H0H_0 is rejected.

Manufacturing Quality Control

A factory produces steel cables with a specified mean breaking strength of $10,000$ N and a known standard deviation of $400$ N. A quality control engineer suspects the machinery needs calibration and takes a random sample of $n = 50$ cables. The sample mean breaking strength is $9,880$ N. The engineer runs a two-tailed hypothesis test with $\alpha = 0.05$.

Based on the sample data, what is the value of the test statistic $Z$, and does the engineer reject the null hypothesis?

The P-Value Approach

Modern statistical software generally reports the p-value, an alternative to the critical value approach that provides more granular information regarding the strength of the evidence against H0H_0.

The p-value is defined as the probability, calculated under the assumption that the null hypothesis is true, of obtaining a test statistic at least as extreme as the one actually observed.

For the standard normal test statistic ZobsZ_{obs}:

  • Two-tailed test: p=2P(ZZobs)p = 2 \cdot P(Z \ge |Z_{obs}|)
  • Right-tailed test: p=P(ZZobs)p = P(Z \ge Z_{obs})
  • Left-tailed test: p=P(ZZobs)p = P(Z \le Z_{obs})

Decision Rule:

  • If pαp \leq \alpha, reject H0H_0.
  • If p>αp > \alpha, fail to reject H0H_0.

A smaller p-value constitutes stronger evidence against the null hypothesis. It is crucial to note that the p-value is not the probability that the null hypothesis is true (P(H0data)P(H_0 \mid \text{data})). It is the probability of the data given the null hypothesis (P(dataH0)P(\text{data} \mid H_0)).

A researcher conducts a hypothesis test and obtains a p-value of 0.034. Does this mean there is a 3.4% chance that the null hypothesis is true?

The Student’s t-Test

In practical applications, the population variance σ2\sigma^2 is almost always unknown. Replacing the population standard deviation σ\sigma with the sample standard deviation ss changes the distribution of the test statistic.

When X1,,XnN(μ,σ2)X_1, \dots, X_n \sim N(\mu, \sigma^2) but σ\sigma is unknown, the test statistic follows a Student’s t-distribution with n1n - 1 degrees of freedom (dfdf):

t=Xˉμ0sntn1t = \frac{\bar{X} - \mu_0}{\frac{s}{\sqrt{n}}} \sim t_{n-1}

The t-distribution is symmetric and bell-shaped like the standard normal distribution but possesses heavier tails. These heavier tails artificially introduce more probability in the extremes, accounting for the additional uncertainty incurred by estimating continuous variance from a finite sample. As nn \to \infty, the t-distribution converges to the standard normal distribution N(0,1)N(0,1).

Multiple Hypothesis Testing

When conducting multiple hypothesis tests simultaneously on a single dataset, the probability of committing at least one Type I error compounds. If a researcher conducts mm independent tests each at significance level α\alpha, the family-wise error rate (FWER)—the probability of making one or more false discoveries—is given by:

FWER=1(1α)m\text{FWER} = 1 - (1 - \alpha)^m

For example, performing 20 tests at α=0.05\alpha = 0.05 yields an FWER of 0.64\approx 0.64. Without correction, false positives are extremely likely.

The Bonferroni Correction

The most conservative method to control the FWER is the Bonferroni correction. To maintain a given family-wise αFWER\alpha_{FWER}, each individual test is evaluated at a newly adjusted significance level:

αindividual=αFWERm\alpha_{individual} = \frac{\alpha_{FWER}}{m}

If 20 tests are conducted and the desired global false positive rate is 5%, each individual p-value must be compared against αindividual=0.05/20=0.0025\alpha_{individual} = 0.05 / 20 = 0.0025.

While mathematically rigorous and guaranteed to bound the FWER under all forms of dependence among tests, the Bonferroni strictly reduces statistical power, exponentially increasing Type II error rates when the number of tests (mm) is massive, as is common in genomics and machine learning algorithms.

Previous Module Probability Theory