This short tutorial shows how you can calculate standard deviation in Python using NumPy. First, we generate the random data with mean of 5 and standard deviation (SD) of 1. Then, you can use the numpy is std() function. As you can see, the mean of the sample is close to 1.
import numpy as np
# mean and standard deviation
mu, sigma = 5, 1
y = np.random.normal(mu, sigma, 100)
print(np.std(y))
1.084308455964664
By default, np.std calculates the population standard deviation. We can calculate the sample standard deviation as well by setting ddof=1. (By default ddof is zero.)
import numpy as np
# mean and standard deviation
mu, sigma = 5, 1
y = np.random.normal(mu, sigma, 100)
print(np.std(y, ddof=1))
1.0897710016498157
Why ddof=1
in NumPy np.std()
You might have questions as to why there is a need for ddof = 1
to calculate standard deviation(SD)
in NumPy
. To begin, the following is the formula for np.std()
in NumPy
.
\[\sqrt{\frac{1}{N-ddof} \sum_{i=1}^N (x_i – \overline{x})^2}\]
This is due to the fact that, typically, we only have a random sample of data from the population, and do not have the data of the whole population. Thus, the calculation of SD
is an estimate of population SD
from a random sample
(e.g., the one we generate from np.random.normal()
).
On the other hand, if you have all the population data, you do NOT need ddof=1
. For instance, if you have all the students’ GPA data in the whole university, you have the whole population of the whole university and your calculation of SD
does not need ddof=1
. In this case, ddof=0
and the formula below is to calculate SD for a population data.
\[\sqrt{\frac{1}{N-ddof} \sum_{i=1}^N (x_i – \overline{x})^2}=\sqrt{\frac{1}{N} \sum_{i=1}^N (x_i – \overline{x})^2}\]
However, if you you do not have the whole populatoin data, you need to set ddof=1
. For instance, if you only have Business School students’ GPA and you want to estimate SD
of the whole university students’ GPA based on the sample of Business School students’, you need to set ddof=1
.
\[\sqrt{\frac{1}{N-ddof} \sum_{i=1}^N (x_i – \overline{x})^2}=\sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i – \overline{x})^2}\]
Write np.std()
formula from Scratch in Python
We can also check our understanding by writing a function to calculate SD from scratch in Python. The following code writes the standard deviation (SD) fromula in Python from scratch. We can see the output result (i.e., 1.084308455964664
) is consistent with np.std(ddof=0)
or np.std()
.
\[\sqrt{\frac{1}{N-ddof} \sum_{i=1}^N (x_i – \overline{x})^2}=\sqrt{\frac{1}{N-0} \sum_{i=1}^N (x_i – \overline{x})^2}\]
import numpy as np
mean_number=np.mean(y)
# setting ddof=0
sd_from_scratch1=np.sqrt((1/len(y))*np.sum(np.square(y-mean_number)))
print(sd_from_scratch1)
1.084308455964664
The following code reflects the following standard devidation formula, with ddof = 1
. As expected, the output is consistent with np.std(ddof=1)
(i.e., 1.0897710016498157
).
\[\sqrt{\frac{1}{N-ddof} \sum_{i=1}^N (x_i – \overline{x})^2}=\sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i – \overline{x})^2}\]
# setting ddof=1
sd_from_scratch2=np.sqrt((1/(len(y)-1))*np.sum(np.square(y-mean_number)))
print(sd_from_scratch2)
1.0897710016498157