Central Limit Theory: Examples and Explanations

Central Limit Theorem (CLT) states that when you take a sufficiently large number of independent random samples from a population (regardless of the population’s original distribution), the sampling distribution of the sample mean will approach a normal distribution.

This tutorual will use two exmaples to show that while the original data do not follow normal distribution (e.g., uniform distribution and binomial distribution), sample mean follows a normal distribution.

1. Rolling Dice (uniform distribution)

  1. Single Roll: A fair 6-sided die has outcomes {1, 2, 3, 4, 5, 6}, each equally likely. The distribution is uniform, not normal.
  2. Mean of Rolls: If we roll the die n times and take the average, the distribution of the results starts to resemble a normal distribution as n increases.
  3. Demonstration with Simulation:
    (1) Roll a die 30 times.
    (2) Record the mean of these 30 rolls.
    (3) Repeat the process many times and plot all the means in a histogram plot.

2. Example of Conversation Rate (binomial distribution)

Suppose you are working to calculate the click-through rate for an ad campaign (i.e., the percentage of people who view the ad and then click the advertising link). Specifically, suppose many people see the ad, and you have a record of whether they click on it or not.

You draw a sample of 100 people and calculate the click-through rate, finding that 5 people click the ad. After this calculation, you draw another sample of 100 people from the same dataset (i.e., sampling with replacement) and find that 4 people click the ad. You repeat this process 10 times. The following is the hypothetical data.

Sample 1: 4 clicks, CTR = 4.00%
Sample 2: 9 clicks, CTR = 9.00%
Sample 3: 6 clicks, CTR = 6.00%
Sample 4: 5 clicks, CTR = 5.00%
Sample 5: 3 clicks, CTR = 3.00%
Sample 6: 3 clicks, CTR = 3.00%
Sample 7: 2 clicks, CTR = 2.00%
Sample 8: 7 clicks, CTR = 7.00%
Sample 9: 5 clicks, CTR = 5.00%
Sample 10: 6 clicks, CTR = 6.00%

We can do a histrogram of these 10 Click Through Rate (CTR). We can see that, it is kind of close to a normal distribution, but not very close.

However, if we increase the frequency from 10 times to 200 times, each still with the sample size of 100, the figure will look more like a normal distribution.

2. From Binomial Distribution to Normal Distribution

Since each ad view results in either a click (success) or no click (failure), and each view is independent of the others, the number of clicks in a sample of 100 follows a binomial distribution:

X∼Bin(n=100,p)

where:

  • n = 100 (number of clicks observed in each sample).
  • p (true probability of a click, which we estimate from data).
  • X (number of clicks observed in each sample).

When we repeatedly sample (e.g., 10 times, 200 times), we get different CTRs due to random variation. The distribution of these sample CTRs (i.e., sample means) starts resembling a normal distribution as we increase the number of samples, thanks to the Central Limit Theorem (CLT).

Key Takeaways:

  1. Individual Click Outcomes: Each person clicking or not follows a Bernoulli distribution (X∼Bern(p)).
  2. Total Clicks in a Sample: The number of successes in a sample of 100 follows a binomial distribution (X∼Bin(100,p)).
  3. Distribution of Sample Means: If we repeatedly take samples and compute CTRs, the sampling distribution of CTRs approximates a normal distribution as the number of samples increases, due to the CLT.

Therea are a questions that need to be answered.

  1. What exactly is Central Limit Theorem (CLT)?
    The Central Limit Theorem (CLT) states that when you take a sufficiently large number of independent random samples from a population (regardless of the population’s original distribution), the sampling distribution of the sample mean will approach a normal distribution.
  2. We were talking about sampling distribution of (click through rate) CTRs. Where is the mean?
    Actually, in the example of click through rate, since it follows binomial distribution, the sample mean is exactly the click through rate. Thus, the histogram plot of CTR is actually the plot of sample mean, which follows a normal distribution.
  3. What exactly is sampling distribution of the sample mean?
    The sampling distribution of the sample mean refers to the distribution of sample means when you repeatedly take random samples from a population and compute their means. For instance, both the distribution of 10 CTRs and the distribution of 200 CTRs are sampling distribution of the sample mean.
import numpy as np
import matplotlib.pyplot as plt

# Parameters
true_ctr = 0.05 # True click-through rate (5%)
n_samples = 200 # Number of times we sample
sample_size = 100 # Sample size per draw

# Generate data
np.random.seed(42) # For reproducibility
click_counts = np.random.binomial(sample_size, true_ctr, n_samples)
click_rates = click_counts / sample_size # Convert counts to percentages

# Display results
for i, ctr in enumerate(click_rates, 1):
print(f"Sample {i}: {ctr * 100:.2f}% CTR")

# Plot histogram
plt.hist(click_rates * 100, bins=10, edgecolor='black', alpha=0.7)
plt.xlabel('Click-Through Rate (%)', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.title('Histogram of Click-Through Rates Across Samples', fontsize=16)
plt.show()

Leave a Comment