Introduction
Since we have covered the theoretical basics of t-test (see the tutorial here), it would be interesting to showcase how we can use t-test for a real-world application. In particular, we are going to use Financial well-being survey data to show how we can use independent t-test via Python. In particular, we could test how gender differs in subjective well-being. You can download the CSV file using this link.

Data Explanation
We will combine all these 3 items to form an index of subjective well-being (SWB).
- SWB_1: I am satisfied with my life.
- SWB_2: I am optimistic about my future.
- SWB_3: If I work hard today, I will be more successful in the future.
Before proceeding, we need to clean the data a bit because there are missing values in the responses. In particular, some responses are “Response not written to database” (-4) or “Refused” (-1).
"SWB_1":{
-4: "Response not written to database",
-1: "Refused",
1: "1 Strongly disagree",
2: "2",
3: "3",
4: "4",
5: "5",
6: "6",
7: "7 Strongly agree"},
Data Cleaning
The following code is to check whether there are such missing values. If so, we need to remove them before conducting the t-test.
import pandas as pd
df=pd.read_csv("./NFWBS_PUF_2016_data.csv")
SWB_1_count=df["SWB_1"].value_counts()
print(SWB_1_count)
SWB_2_count=df["SWB_2"].value_counts()
print(SWB_2_count)
SWB_3_count=df["SWB_3"].value_counts()
print(SWB_3_count)
Below is the output, we can see that indeed, there are some missing values.
6 1926 7 1535 5 1458 4 803 3 335 1 154 2 152 -1 30 -4 1 Name: SWB_1, dtype: int64 6 1846 7 1642 5 1399 4 839 3 335 2 144 1 132 -1 56 -4 1 Name: SWB_2, dtype: int64 7 1991 6 1653 5 1251 4 862 3 267 1 167 2 138 -1 64 -4 1 Name: SWB_3, dtype: int64
The following code is to remove them and then print out to check whether the removal is successful. As we can see the removal is a success.
print("after del")
rslt_df = df.loc[df['SWB_1']>=1]
rslt_df = rslt_df.loc[df['SWB_2'] >=1]
rslt_df = rslt_df.loc[df['SWB_3'] >=1]
SWB_1_count=rslt_df["SWB_1"].value_counts()
print(SWB_1_count)
SWB_2_count=rslt_df["SWB_2"].value_counts()
print(SWB_2_count)
SWB_3_count=rslt_df["SWB_3"].value_counts()
print(SWB_3_count)
after del 6 1912 7 1515 5 1450 4 799 3 333 1 153 2 152 Name: SWB_1, dtype: int64 6 1841 7 1632 5 1394 4 838 3 333 2 144 1 132 Name: SWB_2, dtype: int64 7 1984 6 1649 5 1250 4 860 3 266 1 167 2 138 Name: SWB_3, dtype: int64
The following is the code to form a new column called “Combined_SWB” and the output of “print(rslt_df[“Combined_SWB”]).”
column_names = ['SWB_1', 'SWB_2', 'SWB_3'] df['Fruit Total']= df[column_names].sum(axis=1) rslt_df["Combined_SWB"]=rslt_df[column_names].sum(axis=1) print(rslt_df["Combined_SWB"])
0 16
1 18
2 11
3 18
4 12
..
6389 20
6390 21
6391 17
6392 15
6393 14
Name: Combined_SWB, Length: 6314, dtype: int64
We also need to check whether there are missing values in X, namely the gender. The following is the coding of gender in the survey.
"PPGENDER":{
1: "Male",
2: "Female"}
gender_count=rslt_df[“PPGENDER”].value_counts()print(gender_count)
The following is the output, which shows that there are no missing values.
1 3328 2 2986 Name: PPGENDER, dtype: int64
The following is the key code for t-test.
data_men = rslt_df[rslt_df['PPGENDER']==1]
data_women = rslt_df[rslt_df['PPGENDER']==2]
print("Men's SWB:")
print(data_men["Combined_SWB"].mean())
print("\n")
print("Women's SWB:")
print(data_women["Combined_SWB"].mean())
print("\n")
print("t-test results:")
ttest_results=scipy.stats.ttest_ind(data_men["Combined_SWB"], data_women["Combined_SWB"], equal_var=False)
print(ttest_results)
The following is the output.
Men's SWB: 16.341346153846153 Women's SWB: 16.248827863362358 t-test results: Ttest_indResult(statistic=1.0023437227400076, pvalue=0.3162168765314645)
Based on the p-value, we can see that the difference is not significant. The means are also really very close to each other, SWB men = 16.34 versus SWB women = 16.25, suggesting that men and women do not really differ in terms of subjective well-being. We can also plot the means using bar chart.
sns.barplot(x='PPGENDER', y="Combined_SWB", data=rslt_df)
plt.xlabel('Gender', fontsize=18)
plt.ylabel('SWB', fontsize=18)
plt.show()
