Search Knowledge

© 2026 LIBREUNI PROJECT

Regression Analysis

Regression analysis is a statistical method for estimating the relationships among variables. It focuses primarily on the relationship between a dependent variable (often called the response or outcome variable) and one or more independent variables (often called predictors, covariates, or explanatory variables). The objective is to understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed.

Simple Linear Regression

The most fundamental form of regression analysis is simple linear regression, which models the relationship between a single independent variable XX and a dependent variable YY. The true relationship is postulated to be a linear function of XX plus a stochastic error term.

The population model is defined as: Yi=β0+β1Xi+ϵiY_i = \beta_0 + \beta_1 X_i + \epsilon_i

where:

  • YiY_i is the ii-th observation of the dependent variable.
  • XiX_i is the ii-th observation of the independent variable.
  • β0\beta_0 is the yy-intercept (the expected value of YY when X=0X = 0).
  • β1\beta_1 is the slope coefficient (the expected change in YY for a one-unit change in XX).
  • ϵi\epsilon_i is the unobserved random error or disturbance term.

Assumptions of Simple Linear Regression

For the standard estimation techniques to be valid and possess desirable statistical properties, certain assumptions regarding the error term ϵi\epsilon_i must hold:

  1. Linearity: The expected value of the response variable is a linear function of the explanatory variables. E[YX]=β0+β1X\mathbb{E}[Y | X] = \beta_0 + \beta_1 X, meaning E[ϵX]=0\mathbb{E}[\epsilon | X] = 0.
  2. Independence: The errors are independent of each other. Cov(ϵi,ϵj)=0\text{Cov}(\epsilon_i, \epsilon_j) = 0 for all iji \neq j.
  3. Homoscedasticity (Constant Variance): The errors have a constant variance across all levels of the independent variable. Var(ϵiXi)=σ2\text{Var}(\epsilon_i | X_i) = \sigma^2 for all ii.
  4. Normality (Optional for estimation, required for inference): The errors are normally distributed. ϵiN(0,σ2)\epsilon_i \sim \mathcal{N}(0, \sigma^2).

Ordinary Least Squares (OLS) Estimation

The most common method for estimating the unknown parameters β0\beta_0 and β1\beta_1 is Ordinary Least Squares (OLS). The OLS method chooses the estimates β^0\hat\beta_0 and β^1\hat\beta_1 that minimize the sum of the squared residuals (SSR).

The residual eie_i for the ii-th observation is the difference between the observed YiY_i and the predicted value Y^i\hat{Y}_i: ei=YiY^i=Yi(β^0+β^1Xi)e_i = Y_i - \hat{Y}_i = Y_i - (\hat\beta_0 + \hat\beta_1 X_i)

The Sum of Squared Residuals (SSR) is: S(β^0,β^1)=i=1nei2=i=1n(Yiβ^0β^1Xi)2S(\hat\beta_0, \hat\beta_1) = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (Y_i - \hat\beta_0 - \hat\beta_1 X_i)^2

To minimize SS, we take the partial derivatives with respect to β^0\hat\beta_0 and β^1\hat\beta_1 and set them to zero: Sβ^0=2i=1n(Yiβ^0β^1Xi)=0\frac{\partial S}{\partial \hat\beta_0} = -2 \sum_{i=1}^n (Y_i - \hat\beta_0 - \hat\beta_1 X_i) = 0 Sβ^1=2i=1nXi(Yiβ^0β^1Xi)=0\frac{\partial S}{\partial \hat\beta_1} = -2 \sum_{i=1}^n X_i (Y_i - \hat\beta_0 - \hat\beta_1 X_i) = 0

Solving these normal equations yields the OLS estimators: β^1=i=1n(XiXˉ)(YiYˉ)i=1n(XiXˉ)2=Cov(X,Y)Var(X)\hat\beta_1 = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2} = \frac{\text{Cov}(X, Y)}{\text{Var}(X)} β^0=Yˉβ^1Xˉ\hat\beta_0 = \bar{Y} - \hat\beta_1 \bar{X}

Where Xˉ\bar{X} and Yˉ\bar{Y} are the sample means of XX and YY, respectively.

If the sample covariance between independent variable X and dependent variable Y is exactly zero, what is the value of the OLS estimator for the slope ($\hat\beta_1$)?

Multiple Linear Regression

Multiple linear regression extends the simple linear model to include two or more independent variables. The model with kk predictors is written as: Yi=β0+β1Xi1+β2Xi2++βkXik+ϵiY_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_k X_{ik} + \epsilon_i

Because writing out summations becomes unwieldy, multiple regression is almost universally represented using matrix algebra.

Let YY be an n×1n \times 1 vector of observations of the dependent variable, XX be an n×(k+1)n \times (k+1) matrix (the design matrix) where the first column is typically all 1s (for the intercept), β\beta be a (k+1)×1(k+1) \times 1 vector of parameters, and ϵ\epsilon be an n×1n \times 1 vector of errors.

Y=Xβ+ϵY = X\beta + \epsilon

[Y1Y2Yn]=[1X11X1k1X21X2k1Xn1Xnk][β0β1βk]+[ϵ1ϵ2ϵn]\begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} = \begin{bmatrix} 1 & X_{11} & \cdots & X_{1k} \\ 1 & X_{21} & \cdots & X_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n1} & \cdots & X_{nk} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}

The OLS estimator vector β^\hat\beta minimizes (YXβ^)T(YXβ^)(Y - X\hat\beta)^T (Y - X\hat\beta). Expanding this and taking the derivative with respect to the vector β^\hat\beta yields the matrix formulation of the normal equations: XTXβ^=XTYX^T X \hat\beta = X^T Y

Assuming XTXX^T X is invertible (which requires no perfect multicollinearity among the predictors), the OLS estimator is: β^=(XTX)1XTY\hat\beta = (X^T X)^{-1} X^T Y

The Gauss-Markov Theorem

The Gauss-Markov theorem justifies the use of the OLS estimator. It states that under the classical linear regression model assumptions (linearity, strict exogeneity/independence, no perfect multicollinearity, and homoscedasticity), the OLS estimator β^\hat\beta is the Best Linear Unbiased Estimator (BLUE).

  1. Linear: β^\hat\beta is a linear function of the observed random variables YY. We can write β^=AY\hat\beta = AY where A=(XTX)1XTA = (X^T X)^{-1} X^T.
  2. Unbiased: The expected value of the estimator is the true parameter. E[β^]=β\mathbb{E}[\hat\beta] = \beta. E[β^]=E[(XTX)1XT(Xβ+ϵ)]=β+(XTX)1XTE[ϵ]=β+0=β\mathbb{E}[\hat\beta] = \mathbb{E}[(X^T X)^{-1} X^T (X\beta + \epsilon)] = \beta + (X^T X)^{-1} X^T \mathbb{E}[\epsilon] = \beta + 0 = \beta
  3. Best: It has the minimum variance among all linear unbiased estimators. Var(β^OLS)Var(β~)\text{Var}(\hat\beta_{OLS}) \leq \text{Var}(\tilde{\beta}) for any other linear unbiased estimator β~\tilde{\beta}.

The variance-covariance matrix of the OLS estimator is: Var(β^)=σ2(XTX)1\text{Var}(\hat\beta) = \sigma^2 (X^T X)^{-1} Where σ2\sigma^2 is the variance of the error term, typically estimated by s2=eTenk1s^2 = \frac{e^T e}{n - k - 1}, with ee being the vector of residuals.

Which assumption is NOT required for the OLS estimators to be unbiased (part of the Gauss-Markov theorem)?

Goodness of Fit and Inference

To assess how well the model fits the data, we decompose the total variation in the dependent variable into explained and unexplained components.

  • Total Sum of Squares (SST): Measures the total variation in YY around its mean. SST=i=1n(YiYˉ)2\text{SST} = \sum_{i=1}^n (Y_i - \bar{Y})^2
  • Model/Explained Sum of Squares (SSM): Measures the variation in YY explained by the regression model. SSM=i=1n(Y^iYˉ)2\text{SSM} = \sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2
  • Residual/Error Sum of Squares (SSR): Measures the variation in YY not explained by the model. SSR=i=1n(YiY^i)2=i=1nei2\text{SSR} = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^n e_i^2

The relationship is SST=SSM+SSR\text{SST} = \text{SSM} + \text{SSR}.

Coefficient of Determination (R2R^2)

The R2R^2 statistic represents the proportion of variance in the dependent variable explained by the independent variables in the model. R2=SSMSST=1SSRSSTR^2 = \frac{\text{SSM}}{\text{SST}} = 1 - \frac{\text{SSR}}{\text{SST}}

While 0R210 \leq R^2 \leq 1, adding more predictors to a model will mechanically never decrease R2R^2, even if the predictors are irrelevant. To account for this, the Adjusted R2R^2 penalizes models for adding variables that do not significantly improve the fit: Rˉ2=1(SSR/(nk1)SST/(n1))=1(1R2)n1nk1\bar{R}^2 = 1 - \left( \frac{\text{SSR} / (n - k - 1)}{\text{SST} / (n - 1)} \right) = 1 - (1 - R^2)\frac{n-1}{n-k-1}

Hypothesis Testing

Under the assumption that ϵN(0,σ2I)\epsilon \sim \mathcal{N}(0, \sigma^2 I), the OLS estimators are normally distributed: β^N(β,σ2(XTX)1)\hat\beta \sim \mathcal{N}(\beta, \sigma^2 (X^T X)^{-1})

Test of Individual Significance (t-test)

To test the hypothesis that a single independent variable XjX_j has no effect on YY (i.e., H0:βj=0H_0: \beta_j = 0), a t-statistic is used: t=β^j0SE(β^j)t = \frac{\hat\beta_j - 0}{\text{SE}(\hat\beta_j)} where SE(β^j)\text{SE}(\hat\beta_j) is the standard error of the estimate, found directly from the square root of the jj-th diagonal element of the estimated variance-covariance matrix s2(XTX)1s^2(X^T X)^{-1}. Under the null hypothesis, this statistic follows a Student’s t-distribution with nk1n - k - 1 degrees of freedom.

Test of Overall Significance (F-test)

To test the joint hypothesis that all slope coefficients (excluding the intercept) are simultaneously zero (H0:β1=β2==βk=0H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0), an F-statistic is constructed from the sums of squares: F=SSM/kSSR/(nk1)F = \frac{\text{SSM} / k}{\text{SSR} / (n - k - 1)} Under the null hypothesis, this follows an F-distribution with (k,nk1)(k, n - k - 1) degrees of freedom. A large F-statistic provides evidence against the null hypothesis, indicating that at least one predictor variable is significantly related to the response variable.

Analyzing Real Estate Valuation

A data scientist constructs a multiple linear regression model to predict the price of houses ($Y$, in thousands of dollars) based on square footage ($X_1$), age of the house ($X_2$, in years), and distance to the city center ($X_3$, in miles). The estimated model is $\hat{Y} = 150 + 0.2X_1 - 1.5X_2 - 5.0X_3$. The $R^2$ is 0.75, the Adjusted $R^2$ is 0.74, and the sample size is $n=100$. The standard error for $\hat\beta_2$ is $0.5$.

You want to formally test if the age of the house has a statistically significant effect on the price at a 5% significance level. Calculate the t-statistic for $\hat\beta_2$ and describe the conclusion. Assume the critical t-value for $df = 96$ at $\alpha=0.05$ (two-tailed) is approximately 1.98.

Residual Diagnostics

Estimation is only part of the process; structural validation ensures the model assumptions hold. Analyzing the residuals (eie_i) is the primary tool for diagnostics.

  • Non-linearity: Plotting residuals against predicted values (Y^i\hat{Y}_i) or individual predictors (XiX_i). A non-random U-shape or pattern suggests the relationship is non-linear, perhaps requiring polynomial terms or transformations.
  • Heteroscedasticity: If the spread of the residuals increases or decreases with Y^i\hat{Y}_i (often forming a “funnel” shape in a residual plot), the constant variance assumption is violated. This makes OLS standard errors incorrect, invalidating hypothesis tests. Robust standard errors or Weighted Least Squares (WLS) can address this.
  • Non-normality: A Normal Q-Q (quantile-quantile) plot compares the distribution of the residuals to a theoretical normal distribution. Significant deviations from the straight line, particularly at the tails, imply non-normal errors.
  • Outliers and Leverage: Observations with extreme YY values given their XX values are outliers. Observations with extreme XX values have high leverage. Points with both high leverage and large residuals exert undue influence on the regression line. Cook’s Distance is a metric used to quantify the overall influence of an observation on the estimated coefficients.

Di=j=1n(Y^jY^j(i))2(k+1)s2D_i = \frac{\sum_{j=1}^n (\hat{Y}_j - \hat{Y}_{j(i)})^2}{(k+1)s^2} where Y^j(i)\hat{Y}_{j(i)} is the predicted value of the jj-th observation when the model is refitted without the ii-th observation. A high Cook’s distance indicates a highly influential data point.

Regression analysis serves as the foundational mathematical bedrock for predictive modeling and causal inference, bridging classical statistics to modern machine learning applications.

Previous Module Stochastic Processes