Processing math: 3%

OLS vs. MLE in Linear Regression

This tutorial is to compare OLS (Ordinary Least Square) and Maximum Likelihood Estimate (MLE) in linear regression. We are going to use simple linear regression as examples here. Most of the conclusions can be directly extended into general linear regressions.

OLS in Linear Regression

Coefficients

The principle of ordinary least squares is to minimize the sum of squared deviations from the fitted line. Thus, we want to find the values of β0 and β1, say ˆβ0 and ˆβ1, to minimize the sum.

S= \sum_{i=1}^n (y_i – \beta_0 – \beta_1 x_i)^2

We can estimate them by taking the partial derivative with respect to \beta_0 and \beta_1 .

2 \sum_{i=1}^n (y_i – \hat{\beta_0} – \hat{\beta_1} x_i)(-1) =0

2 \sum_{i=1}^n (y_i – \hat{ \beta_0} – \hat{\beta_1} x_i)(-x_i) =0

We can then get the \hat { \beta}_0 and \hat{\beta}_1 as follows.

\begin{aligned} \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i-\bar{x})y_i}{\sum_{i=1}^n (x_i-\bar{x})^2} \\ \hat{\beta}_0 &=\bar{y}-\hat{\beta}_1 \bar{x}\end{aligned}

Variance

SSE = \sum_{i=1}^n \hat{e}_i^2 = \sum_{i=1}^n (y_i – \hat{\beta}_0-\hat{\beta}_1 x_i)^2

\hat{e}_i = y_i – \hat{\beta}_0-\hat{\beta}_1 x_i is the residuals and SSE is the error sum of squares.

Per Bain and Engelhardt’s “Introduction to Probability and Mathematical Statistics” (p. 502), the principle of ordinary least square “does not provide a direct estimate of \sigma^2, but the magnitude of the variance is reflected in the quantity SSE.” Thus, we can further get the unbiased estimate of \sigma^2 as follows.

\tilde{\sigma}^2 =\frac{SSE}{n-2} =\frac{\sum_{i=1}^n (y_i – \hat{\beta}_0-\hat{\beta}_1 x_i)^2}{n-2}

We can then get the variance for the estimates, namely \hat{\beta}_1 and \hat{\beta}_0 as follows.

\begin{aligned} Var (\hat{\beta}_1) &= \frac{\sigma^2}{ \sum_{i=1}^n (x_i-\bar{x})^2 } \\ Var(\hat{\beta}_0) &= \frac{\sigma^2 \sum_{i=1}^n x_i^2}{ n \sum_{i=1}^n (x_i-\bar{x})^2 } \end{aligned}


MLE in Linear Regression

There is another tutorial about MLE in linear regression using matrix format. The following is the basic model for simple linear regression.

Y = \beta_0 + \beta_1 X +\epsilon

Assume \epsilon to be i.i.d., and following normal distribution N(0, \sigma^2) .

Pr( \{y_i \}_{i=1}^n | \{x_i \}_{i=1}^n, \beta_0, \beta_1, \sigma^2)=\prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{1}{2}(\frac{y_i-(\beta_0+x_i \beta_1)}{\sigma})^2}

We then can use log format and calculate the partial derivative and we can get the estimates as follows. (Regarding the specific steps, please refer to this tutorial.)

\begin{aligned} \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i-\bar{x})y_i}{\sum_{i=1}^n (x_i-\bar{x})^2} \\ \hat{\beta}_0 &=\bar{y}-\hat{\beta}_1 \bar{x} \\ \hat{\sigma}^2 &=\frac{\sum_{i=1}^n (y_i – \hat{\beta}_0-\hat{\beta}_1 x_i)^2}{n} \end{aligned}

According to Bain’s book (p. 510), if y_i=\beta_0+\beta_1 x_i +\epsilon with independent errors \epsilon_i \sim N(0, \sigma^2), the MLE of \beta_0 and \beta_1 have a bivariate normal distribution with E(\hat{\beta}_0) =\beta_0, E(\hat{\beta}_1) =\beta_1, and,

\begin{aligned} Var (\hat{\beta}_1) &= \frac{\sigma^2 }{ \sum_{i=1}^n (x_i-\bar{x})^2 } \\ Var(\hat{\beta}_0) &= \frac{\sigma^2 \sum_{i=1}^n x_i^2}{ n \sum_{i=1}^n (x_i-\bar{x})^2 } \\ Cov(\hat{\beta}_0, \hat{\beta_1}) &= -\frac{\bar{x} \sigma^2}{\sum_i^n (x_i-\bar{x})^2}\end{aligned}

As we can see, the variance of \beta_0 and \beta_1 are the same as those in OLS. The question that remains is how to define \hat{\sigma}^2. Note that, in the context of MLE, the UMVUE of \sigma^2 is the same as the one in OLS.

\tilde{\sigma}^2=\frac{\sum_{i=1}^n (y_i – \hat{\beta}_0-\hat{\beta}_1 x_i)^2}{n-2}

Since we know that \hat{\beta}_0 and \hat{\beta}_1 follow a bivariate normal distribution, we can get the following.

  1. Z_1 = \frac{ \hat{\beta}_1- \beta_1}{\sqrt{\frac{\sigma^2 }{ \sum_{i=1}^n (x_i-\bar{x})^2 }}} =\frac{ \sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}(\hat{\beta}_1- \beta_1)}{\sigma} \sim N(0, 1)
  2. Z_0 = \frac{ \hat{\beta}_0- \beta_0}{ \sqrt{\frac{\sigma^2 \sum_{i=1}^n x_i^2}{ n \sum_{i=1}^n (x_i-\bar{x})^2 } }} =\frac{ \hat{\beta}_0- \beta_0}{\sigma \sqrt{\frac{ \sum_{i=1}^n x_i^2}{ n \sum_{i=1}^n (x_i-\bar{x})^2 } }} \sim N(0, 1)
  3. V=\frac{(n-2)\tilde{\sigma}^2}{\sigma^2} \sim \chi^2(n-2)

Note that, Z_1 and Z_0 are independent of V. Further, we know that the t statistic is the ratio of (1) standard normal and (2) square root of (chi-square/degree of freedom). Note that the degree of freedom of the t statistic is determined by the chi-square in the denominator.

T=\frac{Z}{\sqrt{\frac{\chi^2(n)}{n}}} \sim t(n)

Thus, we can get the following t statistic. Note that, the parameter of population \sigma^2 is removed in the final format. It means that, even if \sigma^2 is unknown, we can still calculate the following.

  1. T_1=\frac{Z_1}{\sqrt{V/(n-2)}} = \frac{Z_1}{\sqrt{\tilde{\sigma}^2/ \sigma^2}} =\frac{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2} (\hat{\beta}_1- \beta_1)} {\tilde{\sigma}} \sim t(n-2)
  2. T_0=\frac{Z_0}{\sqrt{V/(n-2)}} = \frac{Z_0}{\sqrt{\tilde{\sigma}^2/ \sigma^2}} =\frac{\hat{\beta}_0- \beta_0} {\tilde{\sigma} \sqrt{\frac{ \sum_{i=1}^n x_i^2}{ n \sum_{i=1}^n (x_i-\bar{x})^2 } } } \sim t(n-2)

There are 3 possible situations regarding inference tests here.

  1. When \sigma^2 is known, we can just use standard norm (i.e., Z_1 and Z_0). In particular, assign z_1 and z_0 are the computed values for Z_1 and Z_0. If we assign \alpha = 0.05, z_{1-\alpha/2} =1.96.

    Slope:
    H_0: \beta_1 = \beta_{10} vs. H_1: \beta_1 \neq \beta_{10}. We reject H_0 if |z_1| \geq z_{1-\alpha/2}=1.96.

    Intercept:
    H_0: \beta_1 = \beta_{10} vs. H_1: \beta_1 \neq \beta_{10}. We reject H_0 if |z_0| \geq z_{1-\alpha/2}=1.96.
  2. However, typically, \sigma^2 is unknown, and thus we need to use a t-test (i.e., Z_1 and Z_0) to test the regression coefficients. In particular, t_1 and t_0 are the computed values for T_1 and T_0.

    Slope:
    H_0: \beta_1 = \beta_{10} vs. H_1: \beta_1 \neq \beta_{10}. We reject H_0 if |t_1| \geq t_{1-\alpha/2} (n-2).

    Intercept:
    H_0: \beta_0 = \beta_{00} vs. H_1: \beta_0 \neq \beta_{00}. We reject H_0 if |t_0| \geq t_{1-\alpha/2} (n-2).
  3. When \sigma^2 is uknown, we can use statistic V to test hypothesis related to \sigma^2 .

    H_0: \sigma^2 = \sigma^2_0 vs. H_1: \sigma^2 \neq \sigma^2_0. We reject H_0 if v \geq \chi^2_{1-\alpha/2} (n-2) or v \leq \chi^2_{\alpha/2} (n-2).

Compare OLS and MLE

There are a few observations when comparing OLS and MLE in linear regressions.

  1. The estimates of regression coefficients \beta_0 and \beta_1 are the same across OLS and MLE. This conclusion is not impacted by (1) if \sigma^2 is known or not or (2) how \sigma^2 is estimated.
  2. The estimate of \sigma^2 in OLS is (1) based on the general definition of variance (see this Wikipedia link) and (2) unbiased.

    In contrast, the estimate of \sigma^2 in MLE is (1) from the likelihood function and (2) biased. If you use the unbiased one \tilde{\sigma}^2 , OLS and MLE have the same \tilde{\sigma}^2 .
  3. We know that OLS does not assume distribution for \epsilon when estimating the regression coefficients. But, when trying to do a significance test for estimated regression coefficients, we typically assume a normal distribution for the \epsilon , same as MLE. Note that, as shown above, T_0 and T_1 use the unbiased \tilde{\sigma}^2 in the final formulas. Thus, OLS and MLE eventually use the same t statistics for regression coefficients.

    Look at this from another angle. We know that \hat{\sigma}^2_{OLS}=\frac{\sum_{i=1}^n (y_i – \hat{\beta}_0-\hat{\beta}_1 x_i)^2}{n-2} and \hat{\sigma}^2_{MLE}=\frac{\sum_{i=1}^n (y_i – \hat{\beta}_0-\hat{\beta}_1 x_i)^2}{n} . That is, the difference is in the denominator. However, to construct the V: \frac{\hat{\sigma}^2_{OLS} * (n-2)}{\sigma^2}=\frac{\sum_{i=1}^n (y_i – \hat{\beta}_0-\hat{\beta}_1 x_i)^2}{\sigma^2} =\frac{\hat{\sigma}^2_{MLE} * n}{\sigma^2} . Thus, OLS and MLE actually get the same V. Beyong that, OLS and MLE get the same Z_0 and Z_1. Thus, OLS and MLE will generate the same t-statistic.

Reference

There is a discussion on StackExchange on this topic (Are t-statistic formulas and values for linear regression coefficients the same across OLS and MLE?). Further, you can read Bain and Engelhardt’s “Introduction to Probability and Mathematical Statistics (p. 501 – 515).”

Disclaimer

When crafting this tutorial, I made every effort to ensure the accuracy of the information provided. Nevertheless, I cannot guarantee its absolute correctness. Therefore, I strongly advise conducting your own research before reaching any conclusions based on the content presented here.

Leave a Comment