How to Create Dummy Variable in Python

This tutorial shows two methods of creating dummy variables in Python. The following shows the key syntax.

Method 1: Use Numpy.where() to create a dummy variable

np.where(df[‘column_of_interest’] == ‘value’ ,1,0)

Method 2: Use apply() and lambda function to create a dummy variable

df[‘column_of_interest’].apply(lambda x: 1 if x==’value’ else 0)


Example 1: Use numpy.where() to create a dummy variable

We can use numpy.where() to dummy coding and create a dummy variable in Python.

Step 1: Prepare the data

We are going to use the built-in data in Seaborn to illustrate how to dummy coding.

# import the searborn module
import seaborn as sns

# load the penguins data and save it as a dataframe called penguins
penguins = sns.load_dataset("penguins")

# print out the dataframe penguins
print(penguins)

Output:

    species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0    Adelie  Torgersen            39.1           18.7              181.0   
1    Adelie  Torgersen            39.5           17.4              186.0   
2    Adelie  Torgersen            40.3           18.0              195.0   
3    Adelie  Torgersen             NaN            NaN                NaN   
4    Adelie  Torgersen            36.7           19.3              193.0   
..      ...        ...             ...            ...                ...   
339  Gentoo     Biscoe             NaN            NaN                NaN   
340  Gentoo     Biscoe            46.8           14.3              215.0   
341  Gentoo     Biscoe            50.4           15.7              222.0   
342  Gentoo     Biscoe            45.2           14.8              212.0   
343  Gentoo     Biscoe            49.9           16.1              213.0   

     body_mass_g     sex  
0         3750.0    Male  
1         3800.0  Female  
2         3250.0  Female  
3            NaN     NaN  
4         3450.0  Female  
..           ...     ...  
339          NaN     NaN  
340       4850.0  Female  
341       5750.0    Male  
342       5200.0  Female  
343       5400.0    Male  

[344 rows x 7 columns]

Step 2: apply numpy.where()

As we can see, sex has the string data type of male and female. We want to create a new column with dummy coding. The following Python code use numpy.where() to dummy coding it.

# use np.where() to dummy coding and create a new column called sex_dummy
penguins['sex_dummy'] = np.where(penguins['sex'] == 'Male' ,1,0) 

# print out the updated version dataframe penguins
print(penguins)

Output:

    species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0    Adelie  Torgersen            39.1           18.7              181.0   
1    Adelie  Torgersen            39.5           17.4              186.0   
2    Adelie  Torgersen            40.3           18.0              195.0   
3    Adelie  Torgersen             NaN            NaN                NaN   
4    Adelie  Torgersen            36.7           19.3              193.0   
..      ...        ...             ...            ...                ...   
339  Gentoo     Biscoe             NaN            NaN                NaN   
340  Gentoo     Biscoe            46.8           14.3              215.0   
341  Gentoo     Biscoe            50.4           15.7              222.0   
342  Gentoo     Biscoe            45.2           14.8              212.0   
343  Gentoo     Biscoe            49.9           16.1              213.0   

     body_mass_g     sex  sex_dummy  
0         3750.0    Male          1  
1         3800.0  Female          0  
2         3250.0  Female          0  
3            NaN     NaN          0  
4         3450.0  Female          0  
..           ...     ...        ...  
339          NaN     NaN          0  
340       4850.0  Female          0  
341       5750.0    Male          1  
342       5200.0  Female          0  
343       5400.0    Male          1  

[344 rows x 8 columns]

As we can see, a new dummy column of sex_dummy has been added into the dataframe.


Example 2: Use lambda function to create a dummy variable

We can also use if and else statement in lambda function to create a dummy variable. The following shows two steps of how to do so.

Step 1: Prepare the data

We are going to use the built-in data in Seaborn to illustrate how to dummy coding. Since we are going to use the exact dataset shown in example 1, I am going to print out the data again here.

# import the searborn module
import seaborn as sns

# load the penguins data and save it as a dataframe called penguins
penguins = sns.load_dataset("penguins")

Step 2: apply lambda function

# use apply() and lambda function to dummy coding 
penguins['sex_dummy']=penguins['sex'].apply(lambda x: 1 if x=='Male' else 0)

# print out the updated version dataframe penguins
print(penguins)

Output:

    species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0    Adelie  Torgersen            39.1           18.7              181.0   
1    Adelie  Torgersen            39.5           17.4              186.0   
2    Adelie  Torgersen            40.3           18.0              195.0   
3    Adelie  Torgersen             NaN            NaN                NaN   
4    Adelie  Torgersen            36.7           19.3              193.0   
..      ...        ...             ...            ...                ...   
339  Gentoo     Biscoe             NaN            NaN                NaN   
340  Gentoo     Biscoe            46.8           14.3              215.0   
341  Gentoo     Biscoe            50.4           15.7              222.0   
342  Gentoo     Biscoe            45.2           14.8              212.0   
343  Gentoo     Biscoe            49.9           16.1              213.0   

     body_mass_g     sex  sex_dummy  
0         3750.0    Male          1  
1         3800.0  Female          0  
2         3250.0  Female          0  
3            NaN     NaN          0  
4         3450.0  Female          0  
..           ...     ...        ...  
339          NaN     NaN          0  
340       4850.0  Female          0  
341       5750.0    Male          1  
342       5200.0  Female          0  
343       5400.0    Male          1  

[344 rows x 8 columns]

Again, the dummy variable of sex_dummy has been added into the dataframe.


Further Reading