Use t-test to Analyze Financial Well-being Data

Introduction

Since we have covered the theoretical basics of t-test (see the tutorial here), it would be interesting to showcase how we can use t-test for a real-world application. In particular, we are going to use Financial well-being survey data to show how we can use independent t-test via Python. In particular, we could test how gender differs in subjective well-being. You can download the CSV file using this link.

Data Explanation

We will combine all these 3 items to form an index of subjective well-being (SWB).

SWB_1: I am satisfied with my life.
SWB_2: I am optimistic about my future.
SWB_3: If I work hard today, I will be more successful in the future.

Before proceeding, we need to clean the data a bit because there are missing values in the responses. In particular, some responses are “Response not written to database” (-4) or “Refused” (-1).

"SWB_1":{
  -4: "Response not written to database",
  -1: "Refused",
  1: "1 Strongly disagree",
  2: "2",
  3: "3",
  4: "4",
  5: "5",
  6: "6",
  7: "7 Strongly agree"},

Data Cleaning

The following code is to check whether there are such missing values. If so, we need to remove them before conducting the t-test.

import pandas as pd
df=pd.read_csv("./NFWBS_PUF_2016_data.csv")
SWB_1_count=df["SWB_1"].value_counts()
print(SWB_1_count)
SWB_2_count=df["SWB_2"].value_counts()
print(SWB_2_count)
SWB_3_count=df["SWB_3"].value_counts()
print(SWB_3_count)

Below is the output, we can see that indeed, there are some missing values.

 6    1926
 7    1535
 5    1458
 4     803
 3     335
 1     154
 2     152
-1      30
-4       1
Name: SWB_1, dtype: int64
 6    1846
 7    1642
 5    1399
 4     839
 3     335
 2     144
 1     132
-1      56
-4       1
Name: SWB_2, dtype: int64
 7    1991
 6    1653
 5    1251
 4     862
 3     267
 1     167
 2     138
-1      64
-4       1
Name: SWB_3, dtype: int64

The following code is to remove them and then print out to check whether the removal is successful. As we can see the removal is a success.

print("after del")
rslt_df = df.loc[df['SWB_1']>=1]
rslt_df = rslt_df.loc[df['SWB_2'] >=1]
rslt_df = rslt_df.loc[df['SWB_3'] >=1]
SWB_1_count=rslt_df["SWB_1"].value_counts()
print(SWB_1_count)
SWB_2_count=rslt_df["SWB_2"].value_counts()
print(SWB_2_count)
SWB_3_count=rslt_df["SWB_3"].value_counts()
print(SWB_3_count)

after del
6    1912
7    1515
5    1450
4     799
3     333
1     153
2     152
Name: SWB_1, dtype: int64
6    1841
7    1632
5    1394
4     838
3     333
2     144
1     132
Name: SWB_2, dtype: int64
7    1984
6    1649
5    1250
4     860
3     266
1     167
2     138
Name: SWB_3, dtype: int64

The following is the code to form a new column called “Combined_SWB” and the output of “print(rslt_df[“Combined_SWB”]).”

column_names = ['SWB_1', 'SWB_2', 'SWB_3']
df['Fruit Total']= df[column_names].sum(axis=1)
rslt_df["Combined_SWB"]=rslt_df[column_names].sum(axis=1)
print(rslt_df["Combined_SWB"])

0       16
1       18
2       11
3       18
4       12
        ..
6389    20
6390    21
6391    17
6392    15
6393    14
Name: Combined_SWB, Length: 6314, dtype: int64

We also need to check whether there are missing values in X, namely the gender. The following is the coding of gender in the survey.

"PPGENDER":{
  1: "Male",
  2: "Female"}

gender_count=rslt_df[“PPGENDER”].value_counts()print(gender_count)

The following is the output, which shows that there are no missing values.

1    3328
2    2986
Name: PPGENDER, dtype: int64

The following is the key code for t-test.

data_men = rslt_df[rslt_df['PPGENDER']==1]
data_women = rslt_df[rslt_df['PPGENDER']==2]
print("Men's SWB:")
print(data_men["Combined_SWB"].mean())
print("\n")
print("Women's SWB:")
print(data_women["Combined_SWB"].mean())
print("\n")
print("t-test results:")
ttest_results=scipy.stats.ttest_ind(data_men["Combined_SWB"], data_women["Combined_SWB"], equal_var=False)
print(ttest_results)

The following is the output.

Men's SWB:
16.341346153846153

Women's SWB:
16.248827863362358

t-test results:
Ttest_indResult(statistic=1.0023437227400076, pvalue=0.3162168765314645)

Based on the p-value, we can see that the difference is not significant. The means are also really very close to each other, SWB men = 16.34 versus SWB women = 16.25, suggesting that men and women do not really differ in terms of subjective well-being. We can also plot the means using bar chart.

sns.barplot(x='PPGENDER', y="Combined_SWB", data=rslt_df)
plt.xlabel('Gender', fontsize=18)
plt.ylabel('SWB', fontsize=18)
plt.show()