Regression analysis is a statistical method for estimating the relationships among variables. It focuses primarily on the relationship between a dependent variable (often called the response or outcome variable) and one or more independent variables (often called predictors, covariates, or explanatory variables). The objective is to understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed.
Simple Linear Regression
The most fundamental form of regression analysis is simple linear regression, which models the relationship between a single independent variable and a dependent variable . The true relationship is postulated to be a linear function of plus a stochastic error term.
The population model is defined as:
where:
- is the -th observation of the dependent variable.
- is the -th observation of the independent variable.
- is the -intercept (the expected value of when ).
- is the slope coefficient (the expected change in for a one-unit change in ).
- is the unobserved random error or disturbance term.
Assumptions of Simple Linear Regression
For the standard estimation techniques to be valid and possess desirable statistical properties, certain assumptions regarding the error term must hold:
- Linearity: The expected value of the response variable is a linear function of the explanatory variables. , meaning .
- Independence: The errors are independent of each other. for all .
- Homoscedasticity (Constant Variance): The errors have a constant variance across all levels of the independent variable. for all .
- Normality (Optional for estimation, required for inference): The errors are normally distributed. .
Ordinary Least Squares (OLS) Estimation
The most common method for estimating the unknown parameters and is Ordinary Least Squares (OLS). The OLS method chooses the estimates and that minimize the sum of the squared residuals (SSR).
The residual for the -th observation is the difference between the observed and the predicted value :
The Sum of Squared Residuals (SSR) is:
To minimize , we take the partial derivatives with respect to and and set them to zero:
Solving these normal equations yields the OLS estimators:
Where and are the sample means of and , respectively.
If the sample covariance between independent variable X and dependent variable Y is exactly zero, what is the value of the OLS estimator for the slope ($\hat\beta_1$)?
Multiple Linear Regression
Multiple linear regression extends the simple linear model to include two or more independent variables. The model with predictors is written as:
Because writing out summations becomes unwieldy, multiple regression is almost universally represented using matrix algebra.
Let be an vector of observations of the dependent variable, be an matrix (the design matrix) where the first column is typically all 1s (for the intercept), be a vector of parameters, and be an vector of errors.
The OLS estimator vector minimizes . Expanding this and taking the derivative with respect to the vector yields the matrix formulation of the normal equations:
Assuming is invertible (which requires no perfect multicollinearity among the predictors), the OLS estimator is:
The Gauss-Markov Theorem
The Gauss-Markov theorem justifies the use of the OLS estimator. It states that under the classical linear regression model assumptions (linearity, strict exogeneity/independence, no perfect multicollinearity, and homoscedasticity), the OLS estimator is the Best Linear Unbiased Estimator (BLUE).
- Linear: is a linear function of the observed random variables . We can write where .
- Unbiased: The expected value of the estimator is the true parameter. .
- Best: It has the minimum variance among all linear unbiased estimators. for any other linear unbiased estimator .
The variance-covariance matrix of the OLS estimator is: Where is the variance of the error term, typically estimated by , with being the vector of residuals.
Which assumption is NOT required for the OLS estimators to be unbiased (part of the Gauss-Markov theorem)?
Goodness of Fit and Inference
To assess how well the model fits the data, we decompose the total variation in the dependent variable into explained and unexplained components.
- Total Sum of Squares (SST): Measures the total variation in around its mean.
- Model/Explained Sum of Squares (SSM): Measures the variation in explained by the regression model.
- Residual/Error Sum of Squares (SSR): Measures the variation in not explained by the model.
The relationship is .
Coefficient of Determination ()
The statistic represents the proportion of variance in the dependent variable explained by the independent variables in the model.
While , adding more predictors to a model will mechanically never decrease , even if the predictors are irrelevant. To account for this, the Adjusted penalizes models for adding variables that do not significantly improve the fit:
Hypothesis Testing
Under the assumption that , the OLS estimators are normally distributed:
Test of Individual Significance (t-test)
To test the hypothesis that a single independent variable has no effect on (i.e., ), a t-statistic is used: where is the standard error of the estimate, found directly from the square root of the -th diagonal element of the estimated variance-covariance matrix . Under the null hypothesis, this statistic follows a Student’s t-distribution with degrees of freedom.
Test of Overall Significance (F-test)
To test the joint hypothesis that all slope coefficients (excluding the intercept) are simultaneously zero (), an F-statistic is constructed from the sums of squares: Under the null hypothesis, this follows an F-distribution with degrees of freedom. A large F-statistic provides evidence against the null hypothesis, indicating that at least one predictor variable is significantly related to the response variable.
A data scientist constructs a multiple linear regression model to predict the price of houses ($Y$, in thousands of dollars) based on square footage ($X_1$), age of the house ($X_2$, in years), and distance to the city center ($X_3$, in miles). The estimated model is $\hat{Y} = 150 + 0.2X_1 - 1.5X_2 - 5.0X_3$. The $R^2$ is 0.75, the Adjusted $R^2$ is 0.74, and the sample size is $n=100$. The standard error for $\hat\beta_2$ is $0.5$.
You want to formally test if the age of the house has a statistically significant effect on the price at a 5% significance level. Calculate the t-statistic for $\hat\beta_2$ and describe the conclusion. Assume the critical t-value for $df = 96$ at $\alpha=0.05$ (two-tailed) is approximately 1.98.
Residual Diagnostics
Estimation is only part of the process; structural validation ensures the model assumptions hold. Analyzing the residuals () is the primary tool for diagnostics.
- Non-linearity: Plotting residuals against predicted values () or individual predictors (). A non-random U-shape or pattern suggests the relationship is non-linear, perhaps requiring polynomial terms or transformations.
- Heteroscedasticity: If the spread of the residuals increases or decreases with (often forming a “funnel” shape in a residual plot), the constant variance assumption is violated. This makes OLS standard errors incorrect, invalidating hypothesis tests. Robust standard errors or Weighted Least Squares (WLS) can address this.
- Non-normality: A Normal Q-Q (quantile-quantile) plot compares the distribution of the residuals to a theoretical normal distribution. Significant deviations from the straight line, particularly at the tails, imply non-normal errors.
- Outliers and Leverage: Observations with extreme values given their values are outliers. Observations with extreme values have high leverage. Points with both high leverage and large residuals exert undue influence on the regression line. Cook’s Distance is a metric used to quantify the overall influence of an observation on the estimated coefficients.
where is the predicted value of the -th observation when the model is refitted without the -th observation. A high Cook’s distance indicates a highly influential data point.
Regression analysis serves as the foundational mathematical bedrock for predictive modeling and causal inference, bridging classical statistics to modern machine learning applications.