OLS vs. MLE in Linear Regression

This tutorial is to compare OLS (Ordinary Least Square) and Maximum Likelihood Estimate (MLE) in linear regression. We are going to use simple linear regression as examples here. Most of the conclusions can be directly extended into general linear regressions.

OLS in Linear Regression

Coefficients

The principle of ordinary least squares is to minimize the sum of squared deviations from the fitted line. Thus, we want to find the values of \( \beta_0 \) and \( \beta_1 \), say \( \hat { \beta}_0 \) and \( \hat{\beta}_1 \), to minimize the sum.

\( S= \sum_{i=1}^n (y_i – \beta_0 – \beta_1 x_i)^2 \)

We can estimate them by taking the partial derivative with respect to \( \beta_0 \) and \( \beta_1 \).

\( 2 \sum_{i=1}^n (y_i – \hat{\beta_0} – \hat{\beta_1} x_i)(-1) =0\)

\( 2 \sum_{i=1}^n (y_i – \hat{ \beta_0} – \hat{\beta_1} x_i)(-x_i) =0\)

We can then get the \( \hat { \beta}_0 \) and \( \hat{\beta}_1 \) as follows.

\( \begin{aligned} \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i-\bar{x})y_i}{\sum_{i=1}^n (x_i-\bar{x})^2} \\ \hat{\beta}_0 &=\bar{y}-\hat{\beta}_1 \bar{x}\end{aligned}\)

Variance

\( SSE = \sum_{i=1}^n \hat{e}_i^2 = \sum_{i=1}^n (y_i – \hat{\beta}_0-\hat{\beta}_1 x_i)^2 \)

\( \hat{e}_i = y_i – \hat{\beta}_0-\hat{\beta}_1 x_i \) is the residuals and SSE is the error sum of squares.

Per Bain and Engelhardt’s “Introduction to Probability and Mathematical Statistics” (p. 502), the principle of ordinary least square “does not provide a direct estimate of \( \sigma^2\), but the magnitude of the variance is reflected in the quantity SSE.” Thus, we can further get the unbiased estimate of \(\sigma^2\) as follows.

\( \tilde{\sigma}^2 =\frac{SSE}{n-2} =\frac{\sum_{i=1}^n (y_i – \hat{\beta}_0-\hat{\beta}_1 x_i)^2}{n-2}\)

We can then get the variance for the estimates, namely \( \hat{\beta}_1 \) and \( \hat{\beta}_0 \) as follows.

MLE in Linear Regression

There is another tutorial about MLE in linear regression using matrix format. The following is the basic model for simple linear regression.

\( Y = \beta_0 + \beta_1 X +\epsilon \)

Assume \( \epsilon \) to be i.i.d., and following normal distribution \( N(0, \sigma^2) \).

\( Pr( \{y_i \}_{i=1}^n | \{x_i \}_{i=1}^n, \beta_0, \beta_1, \sigma^2)=\prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{1}{2}(\frac{y_i-(\beta_0+x_i \beta_1)}{\sigma})^2} \)

We then can use log format and calculate the partial derivative and we can get the estimates as follows. (Regarding the specific steps, please refer to this tutorial.)

\( \begin{aligned} \hat{\beta}_1 &= \frac{\sum_{i=1}^n (x_i-\bar{x})y_i}{\sum_{i=1}^n (x_i-\bar{x})^2} \\ \hat{\beta}_0 &=\bar{y}-\hat{\beta}_1 \bar{x} \\ \hat{\sigma}^2 &=\frac{\sum_{i=1}^n (y_i – \hat{\beta}_0-\hat{\beta}_1 x_i)^2}{n} \end{aligned} \)

According to Bain’s book (p. 510), if \( y_i=\beta_0+\beta_1 x_i +\epsilon \) with independent errors \( \epsilon_i \sim N(0, \sigma^2)\), the MLE of \(\beta_0 \) and \( \beta_1\) have a bivariate normal distribution with \(E(\hat{\beta}_0) =\beta_0\), \(E(\hat{\beta}_1) =\beta_1\), and,

\( \begin{aligned} Var (\hat{\beta}_1) &= \frac{\sigma^2 }{ \sum_{i=1}^n (x_i-\bar{x})^2 } \\ Var(\hat{\beta}_0) &= \frac{\sigma^2 \sum_{i=1}^n x_i^2}{ n \sum_{i=1}^n (x_i-\bar{x})^2 } \\ Cov(\hat{\beta}_0, \hat{\beta_1}) &= -\frac{\bar{x} \sigma^2}{\sum_i^n (x_i-\bar{x})^2}\end{aligned} \)

As we can see, the variance of \(\beta_0 \) and \( \beta_1 \) are the same as those in OLS. The question that remains is how to define \( \hat{\sigma}^2\). Note that, in the context of MLE, the UMVUE of \(\sigma^2\) is the same as the one in OLS.

\( \tilde{\sigma}^2=\frac{\sum_{i=1}^n (y_i – \hat{\beta}_0-\hat{\beta}_1 x_i)^2}{n-2}\)

Since we know that \( \hat{\beta}_0\) and \( \hat{\beta}_1\) follow a bivariate normal distribution, we can get the following.

\( Z_1 = \frac{ \hat{\beta}_1- \beta_1}{\sqrt{\frac{\sigma^2 }{ \sum_{i=1}^n (x_i-\bar{x})^2 }}} =\frac{ \sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}(\hat{\beta}_1- \beta_1)}{\sigma} \sim N(0, 1)\)
\( Z_0 = \frac{ \hat{\beta}_0- \beta_0}{ \sqrt{\frac{\sigma^2 \sum_{i=1}^n x_i^2}{ n \sum_{i=1}^n (x_i-\bar{x})^2 } }} =\frac{ \hat{\beta}_0- \beta_0}{\sigma \sqrt{\frac{ \sum_{i=1}^n x_i^2}{ n \sum_{i=1}^n (x_i-\bar{x})^2 } }} \sim N(0, 1)\)
\( V=\frac{(n-2)\tilde{\sigma}^2}{\sigma^2} \sim \chi^2(n-2)\)

Note that, \(Z_1 \) and \(Z_0\) are independent of \( V\). Further, we know that the t statistic is the ratio of (1) standard normal and (2) square root of (chi-square/degree of freedom). Note that the degree of freedom of the t statistic is determined by the chi-square in the denominator.

\( T=\frac{Z}{\sqrt{\frac{\chi^2(n)}{n}}} \sim t(n)\)

Thus, we can get the following t statistic. Note that, the parameter of population \( \sigma^2 \) is removed in the final format. It means that, even if \( \sigma^2\) is unknown, we can still calculate the following.

\( T_1=\frac{Z_1}{\sqrt{V/(n-2)}} = \frac{Z_1}{\sqrt{\tilde{\sigma}^2/ \sigma^2}} =\frac{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2} (\hat{\beta}_1- \beta_1)} {\tilde{\sigma}} \sim t(n-2)\)
\( T_0=\frac{Z_0}{\sqrt{V/(n-2)}} = \frac{Z_0}{\sqrt{\tilde{\sigma}^2/ \sigma^2}} =\frac{\hat{\beta}_0- \beta_0} {\tilde{\sigma} \sqrt{\frac{ \sum_{i=1}^n x_i^2}{ n \sum_{i=1}^n (x_i-\bar{x})^2 } } } \sim t(n-2)\)

There are 3 possible situations regarding inference tests here.

When \( \sigma^2 \) is known, we can just use standard norm (i.e., \( Z_1\) and \( Z_0\)). In particular, assign \( z_1 \) and \( z_0 \) are the computed values for \( Z_1\) and \( Z_0\). If we assign \( \alpha = 0.05\), \(z_{1-\alpha/2} =1.96\).

Slope:
\(H_0: \beta_1 = \beta_{10} \) vs. \( H_1: \beta_1 \neq \beta_{10}\). We reject \( H_0 \) if \( |z_1| \geq z_{1-\alpha/2}=1.96\).

Intercept:
\(H_0: \beta_1 = \beta_{10} \) vs. \( H_1: \beta_1 \neq \beta_{10}\). We reject \( H_0 \) if \( |z_0| \geq z_{1-\alpha/2}=1.96\).
However, typically, \( \sigma^2 \) is unknown, and thus we need to use a t-test (i.e., \( Z_1\) and \( Z_0\)) to test the regression coefficients. In particular, \( t_1 \) and \( t_0 \) are the computed values for \( T_1\) and \( T_0\).

Slope:
\(H_0: \beta_1 = \beta_{10} \) vs. \( H_1: \beta_1 \neq \beta_{10}\). We reject \( H_0 \) if \( |t_1| \geq t_{1-\alpha/2} (n-2)\).

Intercept:
\(H_0: \beta_0 = \beta_{00} \) vs. \( H_1: \beta_0 \neq \beta_{00}\). We reject \( H_0 \) if \( |t_0| \geq t_{1-\alpha/2} (n-2)\).
When \( \sigma^2 \) is uknown, we can use statistic \( V \) to test hypothesis related to \( \sigma^2 \).

\(H_0: \sigma^2 = \sigma^2_0 \) vs. \( H_1: \sigma^2 \neq \sigma^2_0\). We reject \( H_0 \) if \( v \geq \chi^2_{1-\alpha/2} (n-2)\) or \( v \leq \chi^2_{\alpha/2} (n-2)\).

Compare OLS and MLE

There are a few observations when comparing OLS and MLE in linear regressions.

The estimates of regression coefficients \( \beta_0\) and \(\beta_1\) are the same across OLS and MLE. This conclusion is not impacted by (1) if \(\sigma^2\) is known or not or (2) how \( \sigma^2\) is estimated.
The estimate of \( \sigma^2\) in OLS is (1) based on the general definition of variance (see this Wikipedia link) and (2) unbiased.

In contrast, the estimate of \( \sigma^2\) in MLE is (1) from the likelihood function and (2) biased. If you use the unbiased one \(\tilde{\sigma}^2 \), OLS and MLE have the same \( \tilde{\sigma}^2 \).
We know that OLS does not assume distribution for \( \epsilon \) when estimating the regression coefficients. But, when trying to do a significance test for estimated regression coefficients, we typically assume a normal distribution for the \( \epsilon \), same as MLE. Note that, as shown above, \( T_0 \) and \( T_1\) use the unbiased \( \tilde{\sigma}^2 \)in the final formulas. Thus, OLS and MLE eventually use the same t statistics for regression coefficients.

Look at this from another angle. We know that \( \hat{\sigma}^2_{OLS}=\frac{\sum_{i=1}^n (y_i – \hat{\beta}_0-\hat{\beta}_1 x_i)^2}{n-2} \) and \( \hat{\sigma}^2_{MLE}=\frac{\sum_{i=1}^n (y_i – \hat{\beta}_0-\hat{\beta}_1 x_i)^2}{n} \). That is, the difference is in the denominator. However, to construct the \( V: \frac{\hat{\sigma}^2_{OLS} * (n-2)}{\sigma^2}=\frac{\sum_{i=1}^n (y_i – \hat{\beta}_0-\hat{\beta}_1 x_i)^2}{\sigma^2} =\frac{\hat{\sigma}^2_{MLE} * n}{\sigma^2} \). Thus, OLS and MLE actually get the same \(V\). Beyong that, OLS and MLE get the same \(Z_0\) and \(Z_1\). Thus, OLS and MLE will generate the same t-statistic.

Reference

There is a discussion on StackExchange on this topic (Are t-statistic formulas and values for linear regression coefficients the same across OLS and MLE?). Further, you can read Bain and Engelhardt’s “Introduction to Probability and Mathematical Statistics (p. 501 – 515).”

Disclaimer

When crafting this tutorial, I made every effort to ensure the accuracy of the information provided. Nevertheless, I cannot guarantee its absolute correctness. Therefore, I strongly advise conducting your own research before reaching any conclusions based on the content presented here.