This tutorial is going to explain what Maximum Likelihood Estimation (MLE) is and how Maximum Likelihood Estimation (MLE) can be used in linear regression.
Basics of Maximum Likelihood Estimation
Before discussing about linear regression, we need to have a basic idea of MLE. In particular, assume that \( y_i \) are all independently and identically distributed (i.i.d.), following a normal distribution with unknown mean \( \mu \) and variance \( \sigma^2 \).
\( y_i \sim N (\mu, \sigma^2) \)
Thus, the conditional probability is as follows.
\( Pr(y_i | \mu, \sigma^2)=\frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{1}{2}(\frac{y_i-\mu}{\sigma})^2} \)
Given that we have \( n \) i.i.d \( y_i \), we can write the conditional probability for all of them.
\( Pr( \{y_i \}_{i=1}^n | \mu, \sigma^2)=\prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{1}{2}(\frac{y_i-\mu}{\sigma})^2} \)
To calculate the unknown \( \mu \), we set the criteria to maximize the product of the joint probability shown above. Such a process of finding model parameters (e.g., \( \mu \) ) based on the criterion of maximizing the likelihood function is called maximum likelihood estimation (MLE).
To make it easier to do calculations, we can use log-likelihood to turn the product format into sums.
\( \begin{aligned} L(\mu) = log Pr( \{y_i \}_{i=1}^n | \mu, \sigma^2) &=\sum_{i=1}^n log (\frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{1}{2}(\frac{y_i-\mu}{\sigma})^2}) \\ &= -n log \sigma – \frac{n}{2} log 2 \pi – \frac{1}{2 \sigma^2} \sum_{n=1}^n (y_i-\mu)^2 \end{aligned} \)
Then, differentiate it with respect to \( \mu \) and set it to zero.
\( \frac{\partial L(\mu) }{ \partial \mu} =\frac{1}{\sigma^2} \sum_{i=1}^n (y_i-\mu)^2=0\)
We then can get the following.
\( \mu=\frac{1}{n} \sum_{i=1}^n y_i\)
Thus, regardless of \( \sigma^2 \), the MLE of \( \mu \) is the sample average of the data.
Maximum Likelihood Estimation in Linear Regression
The following is a statement for linear regression.
\( Y = X^T \beta + \epsilon \)
where:
- Y is the response variable.
- X is the design matric.
- \( \beta \) represents the coefficients.
- \( \epsilon \) is the error term.
Assume \( \epsilon \) to follow the normal distribution of \( N(0, \sigma^2)\). Thus we can see that \( Y – X^T \beta \sim N(0, \sigma^2)\)
Thus, we can get the following.
\( Pr(y_i | x_i, \beta, \sigma^2)=\frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{1}{2}(\frac{y_i-x_i^T \beta}{\sigma})^2} \)
We assume \( \epsilon \) to be i.i.d., and thus we can write it as follows.
\( Pr( \{y_i \}_{i=1}^n | \{x_i \}_{i=1}^n, \beta, \sigma^2)=\prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{1}{2}(\frac{y_i-x_i^T \beta}{\sigma})^2} \)
We can rewrite it as the matrix format, with the help of Multivariate Normal Distributions.
\( \begin{aligned} Pr( \{y_i \}_{i=1}^n | \{x_i \}_{i=1}^n, \beta, \sigma^2) &=N(Y|X \beta, \sigma^2 I) \\ &= (2 \sigma^2 \pi)^{-n/2} exp(-\frac{1}{2\sigma^2} (X\beta-Y)^T(X\beta-Y)) \end{aligned} \)
Similar to the above, we can also take a natural log first to make the calculation of partial derivative with respect to \( \beta \) easier.
\( log Pr( \{y_i \}_{i=1}^n | \{x_i \}_{i=1}^n, \beta, \sigma^2)=-\frac{n}{2} log(2 \sigma^2 \pi)-\frac{1}{2\sigma^2} (X\beta-Y)^T(X\beta-Y)) \)
With respect to \( \beta \), we can simplify it as follows.
\( \beta^{MLE} \propto arg \, MAX_{\beta} \{ -\frac{1}{2\sigma^2} (X\beta-Y)^T(X\beta-Y) \} \)
To maximize the value above, considering that there is a negative sign, we need to minimize the rest.
\( \beta^{MLE} \propto arg \, MIN_{\beta} \{ (X\beta-Y)^T(X\beta-Y) \} \)
We can then do a partial derivative with respect to \( \beta \).
\( \beta^{MLE} \propto \frac {\partial }{\partial \beta} (X\beta-Y)^T(X\beta-Y) =0 \)
To calculate the partial derivative with respect to \( \beta \), we can get the following.
\( \begin{aligned} -(X \beta-Y)^T X – (X \beta – Y)^T X &=0 \\ 2(X \beta-Y)^T X & = 0 \\ (X \beta-Y)^T X & = 0 \\ (\beta^T X^T-Y^T) X & = 0 \\ \beta^T X^TX & =Y^TX \\ \beta^T & =Y^TX (X^TX)^{-1} \\ \beta &= (X^T X)^{-1}X^T Y \\ \hat{B} &= (X^T X)^{-1}X^T Y\end{aligned}\)
Thus, we can see that the estimated \( \hat{B} \) from MLE is exactly the same as \( \hat{B} \) from ordinary least squares (OLS). For more regarding the calculation process, refer to the CMU link at the end of this tutorial.
Estimating Variance in MLE
Earlier, when we tried to estimate \( \beta \), we removed the \( \sigma^2 \). However, in order to have a full picture of the distribution of ε, we need to know the variance. If we have \( n \) observations of \( x_i \) and \( y_i \), we need to estimate the variance and covariance matrix for the Multivariate Normal Distributions.
Note that, here, we assume that there is only the variance-covariance-matrix is a diagonal matrix (i.e., covariance is zero, given the i.i.d. assumption) with the same variance, namely \( \sigma^2 \).
\( \sigma^{MLE}=arg \, MAX_{\sigma} \{-\frac{n}{2} log(2 \sigma^2 \pi)-\frac{1}{2\sigma^2} (X\beta-Y)^T(X\beta-Y) \} \)
Note that, we already know \( \hat{B} \) from the estimation, we can replay \( \beta \) with it.
\( \sigma^{MLE}=arg \, MAX_{\sigma} \{-\frac{n}{2} log(2 \sigma^2 \pi)-\frac{1}{2\sigma^2} (X \hat{B}-Y)^T(X\hat{B}-Y) \} \)
Since we already know \( X, Y \), and \( \hat{B} \), to maximize the probability statement above, we need to take the partial derivative with respect to \( \sigma^2 \).
\( \begin{aligned} \frac{\partial}{\partial \sigma^2} \{-\frac{n}{2} log(2 \sigma^2 \pi)-\frac{1}{2\sigma^2} (X \hat{B}-Y)^T(X\hat{B}-Y) \} &= 0 \\ -\frac{n}{2 \sigma^2} + \frac{1}{2 \sigma^4} (X \hat{B}-Y)^T(X\hat{B}-Y) &= 0 \\ \sigma^2 &= \frac{1}{n}(X \hat{B}-Y)^T(X \hat{B}-Y) \\ \hat{\sigma}^2 &= \frac{1}{n}(X \hat{B}-Y)^T(X \hat{B}-Y) \end{aligned}\)
Thus, we can see that the estimated \( \sigma^2 \) (i.e., variance) is the mean of the squared deviations between what \( \hat{B}\) predicts and what the actual data is. Note that, \( \hat{\sigma}^2\) is used to replace \( \sigma^2\) to indicate that it is an estimated value.
Reference and Further Reading
There are a few other tutorials related to this topic that you can refer to.
- Penn State: Multivariate Normal Distribution
- CMU: Matrix MLE for Linear Regression
- Princeton: Linear Regression via Maximization of the Likelihood
In another tutorial about linear mixed models in SPSS (click here), I discussed the degree of freedom for MLE. As we can see, if we use MLE, we use one more degree of freedom to estimate the variance, assuming the same number in the diagonal variance-covariance (i.e., a scalar matrix).