Python Code to Simulate Two Categorical Independent Variables for Linear Regression

This following python code generates a simulated data, which includes two categorical independent variables and one continuous variable. This simulated dataset can be used for linear regression or ANOVA.

import pandas as pd
import numpy as np

# Set seed for consistency
np.random.seed(42)

# Configuration based on your provided target means
stats_config = [
    # (Brand, Region, Target_Mean,  N)
    ('Brand A', 'East Coast', 25, 50),
    ('Brand A', 'West Coast', 20, 50),
    ('Brand B', 'East Coast', 80,  50),
    ('Brand B', 'West Coast', 21,  50)
]

all_data = []

for brand, region, target_mean, n in stats_config:
    # Generate random data around the target mean
    sales = np.random.normal(loc=target_mean, size=n)
    
    for val in sales:
        all_data.append({
            'Brand': brand,
            'Region': region,
            'Sales_Millions': round(val, 4)
        })

df = pd.DataFrame(all_data)

# Save to CSV
df.to_csv("brand_sales_data.csv", index=False)

# Display verification
print("--- Resulting Means ---")
print(df.groupby(['Brand', 'Region'])['Sales_Millions'].mean())

You can find the CSV file and SAV file for the simulated dataset of two categorical variables here (click to GitHub.)

You need to find the download button to download the SAV file on Github download data from Github

Discussion

Leave a Comment Cancel reply