Calculate p-value in Linear Regression

This tutorial shows how you can calculate p-value for linear regression. It includes formulas and data examples in Python.

Formulas for p-value in Linear Regression

We can estimate the regression coefficient B using the following formula.

\[B =(X^TX)^{-1}X^TY\]

Where,

\[ B = \begin{bmatrix} b_0\\ b_1\\ b_2\end{bmatrix}, X= \left[ \begin{array} {} 1& x_{11} & x_{21} \\ 1 & x_{12} & x_{22} \\ 1 & x_{13} & x_{23} \\..\\1 & x_{1n} & x_{2n} \end{array} \right], Y= \left[ \begin{array} {} y_{11} \\ y_{12} \\ y_{13} \\ ..\\y_{1n} \end{array} \right] \]

Such calculation only generates regression coefficients but no p-values. To calculate the p-value, you need to calculate the t-statistic, which is the ratio of the estimated coefficient and its standard error.

\( t = \frac{B}{SE}\)

where,

\( SE= \sqrt{\frac{\sum_{i=1}^n (Y_i-\bar{Y_i})^2}{df} (X^TX)^{-1}} \)

It should be noted that SE above is a matrix and it includes all the SE for different coefficients, including the intercept. If you only write SE for the slope in simple linear regression (namely only one slope, b1), it will look as follows (see another tutorial online providing SE only for slope b1, click here).

\( SE_{b_1} = \sqrt{ \frac{1}{df}\frac{\sum_{i=1}^n (Y_i-\bar{Y_i})^2}{\sum_{i=1}^n (X_i-\bar{X_i})^2} } \)

Data Example for Calculating P-value for Linear Regression Coefficients

Suppose we are interested in examining the effect of a teaching method (old method = 0 vs. new method = 1) on students’ test scores.

Student IDTeaching Method Test Score
1090
2090
31100
4070
5190
6195
7175
8175
9060
10065
11065
12170
Data Example for Calculating P-value in linear regression

Thus, the regression model is as follows.

Y=b0+b1X

Replacing X and Y with actual variable names,

Test Score=b0+b1 Teaching Method

We can run the following Python code to calculate the b0 (namely intercept) and b1 (namely slope).

# Generate the X matrix
import numpy as np
X_rawdata = np.array([np.ones(12),[0,0,1,0,1,1,1,1,0,0,0,1]])
X_matrix=X_rawdata.T
# Print out X
print("X Matrix:\n", X_matrix)

# Generate the Y vector
Y_rawdata = np.array([[90,90,100,70,90,95,75,75,60,65,65,70]])
Y_vector=Y_rawdata.T
# Print out Y
print("Y Vector:\n",Y_vector)

# calculates X^T
X_matrix_T=X_matrix.transpose()
# calculates X^T X
X_T_X=np.matmul(X_matrix_T,X_matrix)
# calculates (X^T X)^(-1)
X_T_X_Inv=np.linalg.inv(X_T_X)
# calculates (X^T X)^(-1) X^T Y
B_bar=X_T_X_Inv@X_matrix_T@Y_vector
# print out Estimated B, namely B_bar
print("B_bar:\n",B_bar)

The following is the output from the Python code above. In particular, we can see that b0 is 73.33 and b1 is 10.83, which are consistent with the results from another tutorial about the linear mixed effect model (click here).

X Matrix:
 [[1. 0.]
 [1. 0.]
 [1. 1.]
 [1. 0.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 1.]]

Y Vector:
 [[ 90]
 [ 90]
 [100]
 [ 70]
 [ 90]
 [ 95]
 [ 75]
 [ 75]
 [ 60]
 [ 65]
 [ 65]
 [ 70]]

B_bar:
 [[73.33333333]
 [10.83333333]]

We can then calculate the p-value based on the formulas stated above. The following is the Python code.

# Calculate predicted y
Y_predicted = np.matmul(X_matrix,B_bar)
print("Y Predicted:\n", Y_predicted)

# Calculate SE for regression coefficients
SE = np.sqrt((sum((Y_vector -Y_predicted) ** 2)/10) * X_T_X_Inv)
print("SE for regression coefficients: \n", SE)

The following is the output. We can see that the Standard Error (SE) for b0 (intercept) is 5.25, whereas the SE for b1 (slope) is 7.43.

Y Predicted:
 [[73.33333333]
 [73.33333333]
 [84.16666667]
 [73.33333333]
 [84.16666667]
 [84.16666667]
 [84.16666667]
 [84.16666667]
 [73.33333333]
 [73.33333333]
 [73.33333333]
 [84.16666667]]

SE for regression coefficients: 
 [[5.25066133        nan]
 [       nan 7.42555647]]

Based on that, we can calculate the t-statistic for both the intercept and slope (see below). The following is the table summarizing all of the numbers.

\( t_{b0}= \frac{73.33}{5.25} =13.97, t_{b1}= \frac{10.83}{7.43} =1.46\)

Note that, the p-values are determined by t statistic and degree of freedom (df) and you can calculate them using this website.

EstimateSEtdfp-value
b073.335.2513.9710<.00001
b110.837.431.46100.175
Estimate, SE, t-statistic, df, and p-value for linear regression

Further Reading

The following are three references on this topic.

Further, there are a few tutorials on linear regression.

Leave a Comment