This tutorial shows how you can calculate p-value for linear regression. It includes formulas and data examples in Python.

**Formulas for p-value in Linear Regression**

We can estimate the regression coefficient B using the following formula.

\[B =(X^TX)^{-1}X^TY\]

Where,

\[ B = \begin{bmatrix} b_0\\ b_1\\ b_2\end{bmatrix}, X= \left[ \begin{array} {} 1& x_{11} & x_{21} \\ 1 & x_{12} & x_{22} \\ 1 & x_{13} & x_{23} \\..\\1 & x_{1n} & x_{2n} \end{array} \right], Y= \left[ \begin{array} {} y_{11} \\ y_{12} \\ y_{13} \\ ..\\y_{1n} \end{array} \right] \]

Such calculation only generates regression coefficients but no p-values. To calculate the p-value, you need to calculate the t-statistic, which is the ratio of the estimated coefficient and its standard error.

\( t = \frac{B}{SE}\)

where,

\( SE= \sqrt{\frac{\sum_{i=1}^n (Y_i-\bar{Y_i})^2}{df} (X^TX)^{-1}} \)

It should be noted that SE above is a matrix and it includes all the SE for different coefficients, including the intercept. If you only write SE for the slope in simple linear regression (namely only one slope, *b*_{1}), it will look as follows (see another tutorial online providing SE only for slope b_{1}, click here).

\( SE_{b_1} = \sqrt{ \frac{1}{df}\frac{\sum_{i=1}^n (Y_i-\bar{Y_i})^2}{\sum_{i=1}^n (X_i-\bar{X_i})^2} } \)

**Data Example for Calculating P-value for Linear Regression Coefficients**

Suppose we are interested in examining the effect of a teaching method (old method = 0 vs. new method = 1) on students’ test scores.

Student ID | Teaching Method | Test Score |
---|---|---|

1 | 0 | 90 |

2 | 0 | 90 |

3 | 1 | 100 |

4 | 0 | 70 |

5 | 1 | 90 |

6 | 1 | 95 |

7 | 1 | 75 |

8 | 1 | 75 |

9 | 0 | 60 |

10 | 0 | 65 |

11 | 0 | 65 |

12 | 1 | 70 |

Thus, the regression model is as follows.

**Y=b**_{0}**+b**_{1}**X**

Replacing X and Y with actual variable names,

*Test Score=b _{0}+b_{1} Teaching Method*

We can run the following Python code to calculate the *b*_{0} (namely intercept) and *b*_{1} (namely slope).

```
# Generate the X matrix
import numpy as np
X_rawdata = np.array([np.ones(12),[0,0,1,0,1,1,1,1,0,0,0,1]])
X_matrix=X_rawdata.T
# Print out X
print("X Matrix:\n", X_matrix)
# Generate the Y vector
Y_rawdata = np.array([[90,90,100,70,90,95,75,75,60,65,65,70]])
Y_vector=Y_rawdata.T
# Print out Y
print("Y Vector:\n",Y_vector)
# calculates X^T
X_matrix_T=X_matrix.transpose()
# calculates X^T X
X_T_X=np.matmul(X_matrix_T,X_matrix)
# calculates (X^T X)^(-1)
X_T_X_Inv=np.linalg.inv(X_T_X)
# calculates (X^T X)^(-1) X^T Y
B_bar=X_T_X_Inv@X_matrix_T@Y_vector
# print out Estimated B, namely B_bar
print("B_bar:\n",B_bar)
```

The following is the output from the Python code above. In particular, we can see that *b*_{0} is 73.33 and *b*_{1} is 10.83, which are consistent with the results from another tutorial about the linear mixed effect model (click here).

X Matrix: [[1. 0.] [1. 0.] [1. 1.] [1. 0.] [1. 1.] [1. 1.] [1. 1.] [1. 1.] [1. 0.] [1. 0.] [1. 0.] [1. 1.]] Y Vector: [[ 90] [ 90] [100] [ 70] [ 90] [ 95] [ 75] [ 75] [ 60] [ 65] [ 65] [ 70]] B_bar: [[73.33333333] [10.83333333]]

We can then calculate the p-value based on the formulas stated above. The following is the Python code.

```
# Calculate predicted y
Y_predicted = np.matmul(X_matrix,B_bar)
print("Y Predicted:\n", Y_predicted)
# Calculate SE for regression coefficients
SE = np.sqrt((sum((Y_vector -Y_predicted) ** 2)/10) * X_T_X_Inv)
print("SE for regression coefficients: \n", SE)
```

The following is the output. We can see that the Standard Error (SE) for *b _{0}* (intercept) is 5.25, whereas the SE for

*b*

_{1}(slope) is 7.43.

Y Predicted: [[73.33333333] [73.33333333] [84.16666667] [73.33333333] [84.16666667] [84.16666667] [84.16666667] [84.16666667] [73.33333333] [73.33333333] [73.33333333] [84.16666667]] SE for regression coefficients: [[5.25066133 nan] [ nan 7.42555647]]

Based on that, we can calculate the t-statistic for both the intercept and slope (see below). The following is the table summarizing all of the numbers.

\( t_{b0}= \frac{73.33}{5.25} =13.97, t_{b1}= \frac{10.83}{7.43} =1.46\)

Note that, the p-values are determined by t statistic and degree of freedom (df) and you can calculate them using this website.

Estimate | SE | t | df | p-value | |
---|---|---|---|---|---|

b_{0} | 73.33 | 5.25 | 13.97 | 10 | <.00001 |

b_{1} | 10.83 | 7.43 | 1.46 | 10 | 0.175 |

## Further Reading

The following are three references on this topic.

- OLS Summary: P-values and Confidence Intervals
- Understanding the Standard Error of a Regression Slope
- How to calculate p-value for multivariate linear regression

Further, there are a few tutorials on linear regression.