Introduction of crosstab() function
You can use the pandas.crosstab() function to create a contingency table. It computes a simple cross tabulation of two (or more) factors. The following is the sample data
# Generate a sample dataset
import pandas as pd
car_data = {'Brand': ['Brand 1', 'Brand 1','Brand 2','Brand 1','Brand 3'],
'Location': ['CA', 'CA', 'CA','NY','MA'],
'Number':[200,20,300,400,500]}
car_data=pd.DataFrame(data=car_data)
print(car_data)
Brand Location Number 0 Brand 1 CA 200 1 Brand 1 CA 20 2 Brand 2 CA 300 3 Brand 1 NY 400 4 Brand 3 MA 500
Method 1: Create a contingency (frequency) table in Pandas
By default, pandas.crosstab()
computes a frequency table of the factors.
result=pd.crosstab(index=car_data['Brand'], columns=car_data['Location'])
print(result)
Location CA MA NY Brand Brand 1 2 0 1 Brand 2 1 0 0 Brand 3 0 1 0
Method 2: Add margin sums to frequency tables
You can add margin sums by setting the parameter of margins=True
in crosstab()
.
result=pd.crosstab(index=car_data['Brand'], columns=car_data['Location'],margins=True, margins_name="Total")
print(result)
Location CA MA NY Total Brand Brand 1 2 0 1 3 Brand 2 1 0 0 1 Brand 3 0 1 0 1 Total 3 1 1 5
Method 3: Use “values” and “aggfunc” in crosstab()
In some situation, you might not want to count the freqncy, but rather like sum or mean of cell values. You can do that by using the values
and aggfunc
parameters in the crosstab()
function.
result=pd.crosstab(index=car_data['Brand'], columns=car_data['Location'],values=car_data['Number'],aggfunc='sum')
print(result)
Location CA MA NY Brand Brand 1 220.0 NaN 400.0 Brand 2 300.0 NaN NaN Brand 3 NaN 500.0 NaN
Besides sum
, you can also use mean
for aggfunc
in the crosstab()
function.
result=pd.crosstab(index=car_data['Brand'], columns=car_data['Location'],values=car_data['Number'],aggfunc='mean')
print(result)
Location CA MA NY Brand Brand 1 110.0 NaN 400.0 Brand 2 300.0 NaN NaN Brand 3 NaN 500.0 NaN