Hypothesis testing is a formal mathematical framework for making inferential decisions about population parameters based on sample data. It provides a structured methodology to evaluate whether observed data yields sufficient evidence to reject a predefined baseline assumption.
The Null and Alternative Hypotheses
The foundation of any statistical test consists of two mutually exclusive statements about a population parameter: the null hypothesis () and the alternative hypothesis ( or ).
The null hypothesis () typically represents a state of no effect, no difference, or the historical baseline. It is the hypothesis that is assumed true until statistical evidence indicates otherwise.
The alternative hypothesis () represents the claim or theory that the researcher asserts is true, provided the sample data provides sufficient evidence to reject .
For a population mean evaluated against a hypothesized value , tests are formulated in one of three ways:
- Two-tailed test:
- Right-tailed test (Upper-tailed):
- Left-tailed test (Lower-tailed):
The objective of the testing procedure is not to computationally “prove” , but rather to determine if there is enough evidence to reject it in favor of .
Decision Errors in Inference
Because hypothesis testing relies on sample data rather than an exhaustive population census, inferential decisions are subject to probabilistic errors.
Type I Error ()
A Type I Error occurs when the null hypothesis is rejected when it is, in fact, true in the population. This is equivalent to a false positive. The probability of committing a Type I error is denoted by , which is also strictly defined as the significance level of the test.
Type II Error ()
A Type II Error occurs when the null hypothesis is not rejected when the alternative hypothesis is true. This is a false negative. The probability of a Type II error is denoted by .
In a criminal trial setting where $H_0$ is 'the defendant is innocent', what is the consequence of a Type I error?
Statistical Power
The power of a statistical test is the probability of correctly rejecting a false null hypothesis. It is the compliment of the Type II error rate.
Power depends on several factors: the significance level , the sample size , the true effect size (the magnitude of the difference between the true parameter and ), and the population variance . Increasing sample size generally increases the power of a test.
Test Statistics and the Z-Test
A test statistic is a standardized value calculated from sample data during a hypothesis test. It measures the degree of agreement between the sample data and the null hypothesis.
Consider testing the mean of a normally distributed population with a known variance . Let be an independent and identically distributed (i.i.d.) random sample from . The sample mean follows a normal distribution:
Under the null hypothesis , the test statistic is constructed by standardizing :
If is true, the test statistic follows a standard normal distribution, . This distribution governs the probability of observing the test statistic.
The Rejection Region (Critical Value Approach)
The rejection region is the set of values for the test statistic that leads to the rejection of . Its boundaries are determined by the critical values, which depend on the pre-specified significance level and the directionality of the test.
For a two-tailed test at significance level , the critical values are . The decision rule is: Reject if .
For instance, when , . Therefore, if the calculated falls outside the interval , is rejected.
A factory produces steel cables with a specified mean breaking strength of $10,000$ N and a known standard deviation of $400$ N. A quality control engineer suspects the machinery needs calibration and takes a random sample of $n = 50$ cables. The sample mean breaking strength is $9,880$ N. The engineer runs a two-tailed hypothesis test with $\alpha = 0.05$.
Based on the sample data, what is the value of the test statistic $Z$, and does the engineer reject the null hypothesis?
The P-Value Approach
Modern statistical software generally reports the p-value, an alternative to the critical value approach that provides more granular information regarding the strength of the evidence against .
The p-value is defined as the probability, calculated under the assumption that the null hypothesis is true, of obtaining a test statistic at least as extreme as the one actually observed.
For the standard normal test statistic :
- Two-tailed test:
- Right-tailed test:
- Left-tailed test:
Decision Rule:
- If , reject .
- If , fail to reject .
A smaller p-value constitutes stronger evidence against the null hypothesis. It is crucial to note that the p-value is not the probability that the null hypothesis is true (). It is the probability of the data given the null hypothesis ().
A researcher conducts a hypothesis test and obtains a p-value of 0.034. Does this mean there is a 3.4% chance that the null hypothesis is true?
The Student’s t-Test
In practical applications, the population variance is almost always unknown. Replacing the population standard deviation with the sample standard deviation changes the distribution of the test statistic.
When but is unknown, the test statistic follows a Student’s t-distribution with degrees of freedom ():
The t-distribution is symmetric and bell-shaped like the standard normal distribution but possesses heavier tails. These heavier tails artificially introduce more probability in the extremes, accounting for the additional uncertainty incurred by estimating continuous variance from a finite sample. As , the t-distribution converges to the standard normal distribution .
Multiple Hypothesis Testing
When conducting multiple hypothesis tests simultaneously on a single dataset, the probability of committing at least one Type I error compounds. If a researcher conducts independent tests each at significance level , the family-wise error rate (FWER)—the probability of making one or more false discoveries—is given by:
For example, performing 20 tests at yields an FWER of . Without correction, false positives are extremely likely.
The Bonferroni Correction
The most conservative method to control the FWER is the Bonferroni correction. To maintain a given family-wise , each individual test is evaluated at a newly adjusted significance level:
If 20 tests are conducted and the desired global false positive rate is 5%, each individual p-value must be compared against .
While mathematically rigorous and guaranteed to bound the FWER under all forms of dependence among tests, the Bonferroni strictly reduces statistical power, exponentially increasing Type II error rates when the number of tests () is massive, as is common in genomics and machine learning algorithms.