How to Create a Contingency Table in Pandas

Introduction of crosstab() function

You can use the pandas.crosstab() function to create a contingency table. It computes a simple cross tabulation of two (or more) factors. The following is the sample data

# Generate a sample dataset
import pandas as pd
car_data = {'Brand': ['Brand 1', 'Brand 1','Brand 2','Brand 1','Brand 3'], 
     'Location': ['CA', 'CA', 'CA','NY','MA'],
    'Number':[200,20,300,400,500]}
car_data=pd.DataFrame(data=car_data)
print(car_data)
     Brand Location  Number
0  Brand 1       CA     200
1  Brand 1       CA      20
2  Brand 2       CA     300
3  Brand 1       NY     400
4  Brand 3       MA     500

Method 1: Create a contingency (frequency) table in Pandas

By default, pandas.crosstab() computes a frequency table of the factors.

result=pd.crosstab(index=car_data['Brand'], columns=car_data['Location'])
print(result)
Location  CA  MA  NY
Brand               
Brand 1    2   0   1
Brand 2    1   0   0
Brand 3    0   1   0

Method 2: Add margin sums to frequency tables

You can add margin sums by setting the parameter of margins=True in crosstab().

result=pd.crosstab(index=car_data['Brand'], columns=car_data['Location'],margins=True, margins_name="Total")
print(result)
Location  CA  MA  NY  Total
Brand                      
Brand 1    2   0   1      3
Brand 2    1   0   0      1
Brand 3    0   1   0      1
Total      3   1   1      5

Method 3: Use “values” and “aggfunc” in crosstab()

In some situation, you might not want to count the freqncy, but rather like sum or mean of cell values. You can do that by using the values and aggfunc parameters in the crosstab() function.

result=pd.crosstab(index=car_data['Brand'], columns=car_data['Location'],values=car_data['Number'],aggfunc='sum')
print(result)
Location     CA     MA     NY
Brand                        
Brand 1   220.0    NaN  400.0
Brand 2   300.0    NaN    NaN
Brand 3     NaN  500.0    NaN

Besides sum, you can also use mean for aggfunc in the crosstab() function.

result=pd.crosstab(index=car_data['Brand'], columns=car_data['Location'],values=car_data['Number'],aggfunc='mean')
print(result)
Location     CA     MA     NY
Brand                        
Brand 1   110.0    NaN  400.0
Brand 2   300.0    NaN    NaN
Brand 3     NaN  500.0    NaN

Other Resource