Calculation of Standard Deviation in pandas and numpy
I noticed slight differences in the results when calculating the standard deviation using pandas and numpy, so I looked into it and took some notes.
github
- The Jupyter notebook file is available on github here
google colaboratory
- To run on Google Colaboratory, use this link
Execution Environment
!sw_vers
ProductName: macOS
ProductVersion: 11.6.7
BuildVersion: 20G630
!python -V
Python 3.8.13
Running without any adjustments
import pandas as pd
import numpy as np
pd.Series([i for i in range(5)]).std()
1.5811388300841898
np.std([i for i in range(5)])
1.4142135623730951
The results from both are different. Upon checking the documentation, it turns out that numpy calculates with degrees of freedom $n$ by default, while pandas calculates with degrees of freedom $n-1$.
- Degrees of freedom $n$
$$ s=\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} $$
- Degrees of freedom $n - 1$
$$ s=\sqrt{\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} $$
If you explicitly specify the degrees of freedom in the arguments, the results will match.
print(pd.Series([i for i in range(5)]).std(ddof=0))
print(np.std([i for i in range(5)], ddof=0))
1.4142135623730951
1.4142135623730951
print(pd.Series([i for i in range(5)]).std(ddof=1))
print(np.std([i for i in range(5)], ddof=1))
1.5811388300841898
1.5811388300841898
When calculated according to the definitions, the results are as follows and they match.
np.sqrt(np.sum(np.array([i * i for i in range(5)]) - np.power(np.mean([i for i in range(5)]), 2)) / 5)
1.4142135623730951
np.sqrt(np.sum(np.array([i * i for i in range(5)]) - np.power(np.mean([i for i in range(5)]), 2)) / 4)
1.5811388300841898
Although the difference becomes negligible with larger datasets, I took these notes because I encountered slightly different results.