Use sklearn to Calculate SSR in Python

This tutorial shows how to use sklearn to calculate SSR, which stands for Sum of Squared Residuals. SSR is also known as residual sum of squares (RSS) or sum of squared errors (SSE).

\[ SSR=\sum_{i=1}^{n} (\hat{y_i}-y_i)^2 \]

Steps of Using sklearn to Calculate SSR in Python

Step 1: Prepare data

We are going to use a built-in dataset called penguins data from seaborn.

import seaborn as sns
penguins = sns.load_dataset("penguins")

# dummy coding the column of sex
penguins['sex_dummy']=penguins['sex'].apply(lambda x: 1 if x=='Male' else 0)

# drop rows of nan
penguins = penguins.dropna()

# print out the final data
print(penguins)

Output:

    species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0    Adelie  Torgersen            39.1           18.7              181.0   
1    Adelie  Torgersen            39.5           17.4              186.0   
2    Adelie  Torgersen            40.3           18.0              195.0   
4    Adelie  Torgersen            36.7           19.3              193.0   
5    Adelie  Torgersen            39.3           20.6              190.0   
..      ...        ...             ...            ...                ...   
338  Gentoo     Biscoe            47.2           13.7              214.0   
340  Gentoo     Biscoe            46.8           14.3              215.0   
341  Gentoo     Biscoe            50.4           15.7              222.0   
342  Gentoo     Biscoe            45.2           14.8              212.0   
343  Gentoo     Biscoe            49.9           16.1              213.0   

     body_mass_g     sex  sex_dummy  
0         3750.0    Male          1  
1         3800.0  Female          0  
2         3250.0  Female          0  
4         3450.0  Female          0  
5         3650.0    Male          1  
..           ...     ...        ...  
338       4925.0  Female          0  
340       4850.0  Female          0  
341       5750.0    Male          1  
342       5200.0  Female          0  
343       5400.0    Male          1  

Step 2: determine IVs and DV

IVs will be flipper length and sex, and DV will be penguins’ body weight.

Body Weight = b0+b1 flipper length+b2 sex

# select the columns of 'flipper_length_mm','sex' as the IVs
IVs = penguins[['flipper_length_mm','sex_dummy']]

# select the column of 'body_mass_g' as the DV
DV = penguins['body_mass_g']

Step 3: apply linearregression() from sklearn

# import packages
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
lm = LinearRegression()

# save it to result
result = lm.fit(IVs, DV)

print("Result is as follows:")
print("Intercept:\n",result.intercept_)
print("Regression Coefficients:\n", result.coef_)

Output:

Result is as follows:
Intercept:
 -5410.300224143296
Regression Coefficients:
 [ 46.98217525 347.85025373]

Step 4: calculate SSR

# Use the result to calculate the estimated Y values
Y_estimated = result.predict(IVs)

# combine observed and estimated Y into the same dataframe
df = pd.DataFrame({'Observed': DV, 'Estimated':Y_estimated})

# calculate SSR
print('SSR :\n ', np.sum(np.square(df['Estimated'] - df['Observed'])))

Output:

SSR :
  41795373.64945871

Thus, the Sum of Squared Residuals (SSR) is 41795373.65.


Further Reading