Calculation of Standard Deviation in pandas and numpy

I noticed slight differences in the results when calculating the standard deviation using pandas and numpy, so I looked into it and took some notes.

github

  • The Jupyter notebook file is available on github here

google colaboratory

Execution Environment

!sw_vers
ProductName:	macOS
ProductVersion:	11.6.7
BuildVersion:	20G630
!python -V
Python 3.8.13

Running without any adjustments

import pandas as pd
import numpy as np

pd.Series([i for i in range(5)]).std()
1.5811388300841898
np.std([i for i in range(5)])
1.4142135623730951

The results from both are different. Upon checking the documentation, it turns out that numpy calculates with degrees of freedom $n$ by default, while pandas calculates with degrees of freedom $n-1$.

  • Degrees of freedom $n$

$$ s=\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} $$

  • Degrees of freedom $n - 1$

$$ s=\sqrt{\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} $$

If you explicitly specify the degrees of freedom in the arguments, the results will match.

print(pd.Series([i for i in range(5)]).std(ddof=0))
print(np.std([i for i in range(5)], ddof=0))
1.4142135623730951
1.4142135623730951
print(pd.Series([i for i in range(5)]).std(ddof=1))
print(np.std([i for i in range(5)], ddof=1))
1.5811388300841898
1.5811388300841898

When calculated according to the definitions, the results are as follows and they match.

np.sqrt(np.sum(np.array([i * i for i in range(5)]) - np.power(np.mean([i for i in range(5)]), 2)) / 5)
1.4142135623730951
np.sqrt(np.sum(np.array([i * i for i in range(5)]) - np.power(np.mean([i for i in range(5)]), 2)) / 4)
1.5811388300841898

Although the difference becomes negligible with larger datasets, I took these notes because I encountered slightly different results.