[python] Calculation of Standard Deviation in pandas and numpy

Calculation of Standard Deviation in pandas and numpy

I noticed slight differences in the results when calculating the standard deviation using pandas and numpy, so I looked into it and took some notes.

github

The Jupyter notebook file is available on github here

google colaboratory

To run on Google Colaboratory, use this link

Execution Environment

!sw_vers

ProductName:	macOS
ProductVersion:	11.6.7
BuildVersion:	20G630

!python -V

Python 3.8.13

Running without any adjustments

import pandas as pd
import numpy as np

pd.Series([i for i in range(5)]).std()

1.5811388300841898

np.std([i for i in range(5)])

1.4142135623730951

The results from both are different. Upon checking the documentation, it turns out that numpy calculates with degrees of freedom $n$ by default, while pandas calculates with degrees of freedom $n-1$.

Degrees of freedom $n$

$$ s=\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} $$

Degrees of freedom $n - 1$

$$ s=\sqrt{\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} $$

If you explicitly specify the degrees of freedom in the arguments, the results will match.

print(pd.Series([i for i in range(5)]).std(ddof=0))
print(np.std([i for i in range(5)], ddof=0))

1.4142135623730951
1.4142135623730951

print(pd.Series([i for i in range(5)]).std(ddof=1))
print(np.std([i for i in range(5)], ddof=1))

1.5811388300841898
1.5811388300841898

When calculated according to the definitions, the results are as follows and they match.

np.sqrt(np.sum(np.array([i * i for i in range(5)]) - np.power(np.mean([i for i in range(5)]), 2)) / 5)

1.4142135623730951

np.sqrt(np.sum(np.array([i * i for i in range(5)]) - np.power(np.mean([i for i in range(5)]), 2)) / 4)

1.5811388300841898

Although the difference becomes negligible with larger datasets, I took these notes because I encountered slightly different results.