Calculate p-value in Linear Regression

This tutorial shows how you can calculate p-value for linear regression. It includes formulas and data examples in Python.

Formulas for p-value in Linear Regression

We can estimate the regression coefficient B using the following formula.

\[B =(X^TX)^{-1}X^TY\]

Where,

\[ B = \begin{bmatrix} b_0\\ b_1\\ b_2\end{bmatrix}, X= \left[ \begin{array} {} 1& x_{11} & x_{21} \\ 1 & x_{12} & x_{22} \\ 1 & x_{13} & x_{23} \\..\\1 & x_{1n} & x_{2n} \end{array} \right], Y= \left[ \begin{array} {} y_{11} \\ y_{12} \\ y_{13} \\ ..\\y_{1n} \end{array} \right] \]

Such calculation only generates regression coefficients but no p-values. To calculate the p-value, you need to calculate the t-statistic, which is the ratio of the estimated coefficient and its standard error.

\( t = \frac{B}{SE}\)

where,

\( SE= \sqrt{\frac{\sum_{i=1}^n (Y_i-\bar{Y_i})^2}{df} (X^TX)^{-1}} \)

It should be noted that SE above is a matrix and it includes all the SE for different coefficients, including the intercept. If you only write SE for the slope in simple linear regression (namely only one slope, b₁), it will look as follows (see another tutorial online providing SE only for slope b₁, click here).

\( SE_{b_1} = \sqrt{ \frac{1}{df}\frac{\sum_{i=1}^n (Y_i-\bar{Y_i})^2}{\sum_{i=1}^n (X_i-\bar{X_i})^2} } \)

Data Example for Calculating P-value for Linear Regression Coefficients

Suppose we are interested in examining the effect of a teaching method (old method = 0 vs. new method = 1) on students’ test scores.

Student ID	Teaching Method	Test Score
1	0	90
2	0	90
3	1	100
4	0	70
5	1	90
6	1	95
7	1	75
8	1	75
9	0	60
10	0	65
11	0	65
12	1	70

Data Example for Calculating P-value in linear regression

Thus, the regression model is as follows.

Y=b₀+b₁X

Replacing X and Y with actual variable names,

Test Score=b₀+b₁ Teaching Method

We can run the following Python code to calculate the b₀ (namely intercept) and b₁ (namely slope).

# Generate the X matrix
import numpy as np
X_rawdata = np.array([np.ones(12),[0,0,1,0,1,1,1,1,0,0,0,1]])
X_matrix=X_rawdata.T
# Print out X
print("X Matrix:\n", X_matrix)

# Generate the Y vector
Y_rawdata = np.array([[90,90,100,70,90,95,75,75,60,65,65,70]])
Y_vector=Y_rawdata.T
# Print out Y
print("Y Vector:\n",Y_vector)

# calculates X^T
X_matrix_T=X_matrix.transpose()
# calculates X^T X
X_T_X=np.matmul(X_matrix_T,X_matrix)
# calculates (X^T X)^(-1)
X_T_X_Inv=np.linalg.inv(X_T_X)
# calculates (X^T X)^(-1) X^T Y
B_bar=X_T_X_Inv@X_matrix_T@Y_vector
# print out Estimated B, namely B_bar
print("B_bar:\n",B_bar)

The following is the output from the Python code above. In particular, we can see that b₀ is 73.33 and b₁ is 10.83, which are consistent with the results from another tutorial about the linear mixed effect model (click here).

X Matrix:
 [[1. 0.]
 [1. 0.]
 [1. 1.]
 [1. 0.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 1.]]

Y Vector:
 [[ 90]
 [ 90]
 [100]
 [ 70]
 [ 90]
 [ 95]
 [ 75]
 [ 75]
 [ 60]
 [ 65]
 [ 65]
 [ 70]]

B_bar:
 [[73.33333333]
 [10.83333333]]

We can then calculate the p-value based on the formulas stated above. The following is the Python code.

# Calculate predicted y
Y_predicted = np.matmul(X_matrix,B_bar)
print("Y Predicted:\n", Y_predicted)

# Calculate SE for regression coefficients
SE = np.sqrt((sum((Y_vector -Y_predicted) ** 2)/10) * X_T_X_Inv)
print("SE for regression coefficients: \n", SE)

The following is the output. We can see that the Standard Error (SE) for b₀ (intercept) is 5.25, whereas the SE for b₁ (slope) is 7.43.

Y Predicted:
 [[73.33333333]
 [73.33333333]
 [84.16666667]
 [73.33333333]
 [84.16666667]
 [84.16666667]
 [84.16666667]
 [84.16666667]
 [73.33333333]
 [73.33333333]
 [73.33333333]
 [84.16666667]]

SE for regression coefficients: 
 [[5.25066133        nan]
 [       nan 7.42555647]]

Based on that, we can calculate the t-statistic for both the intercept and slope (see below). The following is the table summarizing all of the numbers.

\( t_{b0}= \frac{73.33}{5.25} =13.97, t_{b1}= \frac{10.83}{7.43} =1.46\)

Note that, the p-values are determined by t statistic and degree of freedom (df) and you can calculate them using this website.

	Estimate	SE	t	df	p-value
b₀	73.33	5.25	13.97	10	<.00001
b₁	10.83	7.43	1.46	10	0.175

Estimate, SE, t-statistic, df, and p-value for linear regression

Calculate p-value in Linear Regression

Formulas for p-value in Linear Regression

Data Example for Calculating P-value for Linear Regression Coefficients

Further Reading

Leave a Comment Cancel reply