This tutorial shows two methods of creating dummy variables in Python. The following shows the key syntax.
Method 1: Use Numpy.where() to create a dummy variable
np.where(df[‘column_of_interest’] == ‘value’ ,1,0)
Method 2: Use apply() and lambda function to create a dummy variable
df[‘column_of_interest’].apply(lambda x: 1 if x==’value’ else 0)
Example 1: Use numpy.where() to create a dummy variable
We can use numpy.where() to dummy coding and create a dummy variable in Python.
Step 1: Prepare the data
We are going to use the built-in data in Seaborn to illustrate how to dummy coding.
# import the searborn module
import seaborn as sns
# load the penguins data and save it as a dataframe called penguins
penguins = sns.load_dataset("penguins")
# print out the dataframe penguins
print(penguins)
Output:
species island bill_length_mm bill_depth_mm flipper_length_mm \ 0 Adelie Torgersen 39.1 18.7 181.0 1 Adelie Torgersen 39.5 17.4 186.0 2 Adelie Torgersen 40.3 18.0 195.0 3 Adelie Torgersen NaN NaN NaN 4 Adelie Torgersen 36.7 19.3 193.0 .. ... ... ... ... ... 339 Gentoo Biscoe NaN NaN NaN 340 Gentoo Biscoe 46.8 14.3 215.0 341 Gentoo Biscoe 50.4 15.7 222.0 342 Gentoo Biscoe 45.2 14.8 212.0 343 Gentoo Biscoe 49.9 16.1 213.0 body_mass_g sex 0 3750.0 Male 1 3800.0 Female 2 3250.0 Female 3 NaN NaN 4 3450.0 Female .. ... ... 339 NaN NaN 340 4850.0 Female 341 5750.0 Male 342 5200.0 Female 343 5400.0 Male [344 rows x 7 columns]
Step 2: apply numpy.where()
As we can see, sex has the string data type of male and female. We want to create a new column with dummy coding. The following Python code use numpy.where() to dummy coding it.
# use np.where() to dummy coding and create a new column called sex_dummy
penguins['sex_dummy'] = np.where(penguins['sex'] == 'Male' ,1,0)
# print out the updated version dataframe penguins
print(penguins)
Output:
species island bill_length_mm bill_depth_mm flipper_length_mm \ 0 Adelie Torgersen 39.1 18.7 181.0 1 Adelie Torgersen 39.5 17.4 186.0 2 Adelie Torgersen 40.3 18.0 195.0 3 Adelie Torgersen NaN NaN NaN 4 Adelie Torgersen 36.7 19.3 193.0 .. ... ... ... ... ... 339 Gentoo Biscoe NaN NaN NaN 340 Gentoo Biscoe 46.8 14.3 215.0 341 Gentoo Biscoe 50.4 15.7 222.0 342 Gentoo Biscoe 45.2 14.8 212.0 343 Gentoo Biscoe 49.9 16.1 213.0 body_mass_g sex sex_dummy 0 3750.0 Male 1 1 3800.0 Female 0 2 3250.0 Female 0 3 NaN NaN 0 4 3450.0 Female 0 .. ... ... ... 339 NaN NaN 0 340 4850.0 Female 0 341 5750.0 Male 1 342 5200.0 Female 0 343 5400.0 Male 1 [344 rows x 8 columns]
As we can see, a new dummy column of sex_dummy has been added into the dataframe.
Example 2: Use lambda function to create a dummy variable
We can also use if and else statement in lambda function to create a dummy variable. The following shows two steps of how to do so.
Step 1: Prepare the data
We are going to use the built-in data in Seaborn to illustrate how to dummy coding. Since we are going to use the exact dataset shown in example 1, I am going to print out the data again here.
# import the searborn module
import seaborn as sns
# load the penguins data and save it as a dataframe called penguins
penguins = sns.load_dataset("penguins")
Step 2: apply lambda function
# use apply() and lambda function to dummy coding
penguins['sex_dummy']=penguins['sex'].apply(lambda x: 1 if x=='Male' else 0)
# print out the updated version dataframe penguins
print(penguins)
Output:
species island bill_length_mm bill_depth_mm flipper_length_mm \ 0 Adelie Torgersen 39.1 18.7 181.0 1 Adelie Torgersen 39.5 17.4 186.0 2 Adelie Torgersen 40.3 18.0 195.0 3 Adelie Torgersen NaN NaN NaN 4 Adelie Torgersen 36.7 19.3 193.0 .. ... ... ... ... ... 339 Gentoo Biscoe NaN NaN NaN 340 Gentoo Biscoe 46.8 14.3 215.0 341 Gentoo Biscoe 50.4 15.7 222.0 342 Gentoo Biscoe 45.2 14.8 212.0 343 Gentoo Biscoe 49.9 16.1 213.0 body_mass_g sex sex_dummy 0 3750.0 Male 1 1 3800.0 Female 0 2 3250.0 Female 0 3 NaN NaN 0 4 3450.0 Female 0 .. ... ... ... 339 NaN NaN 0 340 4850.0 Female 0 341 5750.0 Male 1 342 5200.0 Female 0 343 5400.0 Male 1 [344 rows x 8 columns]
Again, the dummy variable of sex_dummy has been added into the dataframe.