This tutorial shows how you can calculate p-value for linear regression. It includes formulas and data examples in Python.
Formulas for p-value in Linear Regression
We can estimate the regression coefficient B using the following formula.
\[B =(X^TX)^{-1}X^TY\]
Where,
\[ B = \begin{bmatrix} b_0\\ b_1\\ b_2\end{bmatrix}, X= \left[ \begin{array} {} 1& x_{11} & x_{21} \\ 1 & x_{12} & x_{22} \\ 1 & x_{13} & x_{23} \\..\\1 & x_{1n} & x_{2n} \end{array} \right], Y= \left[ \begin{array} {} y_{11} \\ y_{12} \\ y_{13} \\ ..\\y_{1n} \end{array} \right] \]
Such calculation only generates regression coefficients but no p-values. To calculate the p-value, you need to calculate the t-statistic, which is the ratio of the estimated coefficient and its standard error.
\( t = \frac{B}{SE}\)
where,
\( SE= \sqrt{\frac{\sum_{i=1}^n (Y_i-\bar{Y_i})^2}{df} (X^TX)^{-1}} \)
It should be noted that SE above is a matrix and it includes all the SE for different coefficients, including the intercept. If you only write SE for the slope in simple linear regression (namely only one slope, b1), it will look as follows (see another tutorial online providing SE only for slope b1, click here).
\( SE_{b_1} = \sqrt{ \frac{1}{df}\frac{\sum_{i=1}^n (Y_i-\bar{Y_i})^2}{\sum_{i=1}^n (X_i-\bar{X_i})^2} } \)
Data Example for Calculating P-value for Linear Regression Coefficients
Suppose we are interested in examining the effect of a teaching method (old method = 0 vs. new method = 1) on students’ test scores.
Student ID | Teaching Method | Test Score |
---|---|---|
1 | 0 | 90 |
2 | 0 | 90 |
3 | 1 | 100 |
4 | 0 | 70 |
5 | 1 | 90 |
6 | 1 | 95 |
7 | 1 | 75 |
8 | 1 | 75 |
9 | 0 | 60 |
10 | 0 | 65 |
11 | 0 | 65 |
12 | 1 | 70 |
Thus, the regression model is as follows.
Y=b0+b1X
Replacing X and Y with actual variable names,
Test Score=b0+b1 Teaching Method
We can run the following Python code to calculate the b0 (namely intercept) and b1 (namely slope).
# Generate the X matrix
import numpy as np
X_rawdata = np.array([np.ones(12),[0,0,1,0,1,1,1,1,0,0,0,1]])
X_matrix=X_rawdata.T
# Print out X
print("X Matrix:\n", X_matrix)
# Generate the Y vector
Y_rawdata = np.array([[90,90,100,70,90,95,75,75,60,65,65,70]])
Y_vector=Y_rawdata.T
# Print out Y
print("Y Vector:\n",Y_vector)
# calculates X^T
X_matrix_T=X_matrix.transpose()
# calculates X^T X
X_T_X=np.matmul(X_matrix_T,X_matrix)
# calculates (X^T X)^(-1)
X_T_X_Inv=np.linalg.inv(X_T_X)
# calculates (X^T X)^(-1) X^T Y
B_bar=X_T_X_Inv@X_matrix_T@Y_vector
# print out Estimated B, namely B_bar
print("B_bar:\n",B_bar)
The following is the output from the Python code above. In particular, we can see that b0 is 73.33 and b1 is 10.83, which are consistent with the results from another tutorial about the linear mixed effect model (click here).
X Matrix: [[1. 0.] [1. 0.] [1. 1.] [1. 0.] [1. 1.] [1. 1.] [1. 1.] [1. 1.] [1. 0.] [1. 0.] [1. 0.] [1. 1.]] Y Vector: [[ 90] [ 90] [100] [ 70] [ 90] [ 95] [ 75] [ 75] [ 60] [ 65] [ 65] [ 70]] B_bar: [[73.33333333] [10.83333333]]
We can then calculate the p-value based on the formulas stated above. The following is the Python code.
# Calculate predicted y
Y_predicted = np.matmul(X_matrix,B_bar)
print("Y Predicted:\n", Y_predicted)
# Calculate SE for regression coefficients
SE = np.sqrt((sum((Y_vector -Y_predicted) ** 2)/10) * X_T_X_Inv)
print("SE for regression coefficients: \n", SE)
The following is the output. We can see that the Standard Error (SE) for b0 (intercept) is 5.25, whereas the SE for b1 (slope) is 7.43.
Y Predicted: [[73.33333333] [73.33333333] [84.16666667] [73.33333333] [84.16666667] [84.16666667] [84.16666667] [84.16666667] [73.33333333] [73.33333333] [73.33333333] [84.16666667]] SE for regression coefficients: [[5.25066133 nan] [ nan 7.42555647]]
Based on that, we can calculate the t-statistic for both the intercept and slope (see below). The following is the table summarizing all of the numbers.
\( t_{b0}= \frac{73.33}{5.25} =13.97, t_{b1}= \frac{10.83}{7.43} =1.46\)
Note that, the p-values are determined by t statistic and degree of freedom (df) and you can calculate them using this website.
Estimate | SE | t | df | p-value | |
---|---|---|---|---|---|
b0 | 73.33 | 5.25 | 13.97 | 10 | <.00001 |
b1 | 10.83 | 7.43 | 1.46 | 10 | 0.175 |
Further Reading
The following are three references on this topic.
- OLS Summary: P-values and Confidence Intervals
- Understanding the Standard Error of a Regression Slope
- How to calculate p-value for multivariate linear regression
Further, there are a few tutorials on linear regression.