## Calculation of Standard Deviation in pandas and numpy

I noticed slight differences in the results when calculating the standard deviation using pandas and numpy, so I looked into it and took some notes.

### github

- The Jupyter notebook file is available on github here

### google colaboratory

- To run on Google Colaboratory, use this link

### Execution Environment

```
!sw_vers
```

```
ProductName: macOS
ProductVersion: 11.6.7
BuildVersion: 20G630
```

```
!python -V
```

```
Python 3.8.13
```

### Running without any adjustments

```
import pandas as pd
import numpy as np
pd.Series([i for i in range(5)]).std()
```

```
1.5811388300841898
```

```
np.std([i for i in range(5)])
```

```
1.4142135623730951
```

The results from both are different. Upon checking the documentation, it turns out that numpy calculates with degrees of freedom $n$ by default, while pandas calculates with degrees of freedom $n-1$.

- Degrees of freedom $n$

$$ s=\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} $$

- Degrees of freedom $n - 1$

$$ s=\sqrt{\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} $$

If you explicitly specify the degrees of freedom in the arguments, the results will match.

```
print(pd.Series([i for i in range(5)]).std(ddof=0))
print(np.std([i for i in range(5)], ddof=0))
```

```
1.4142135623730951
1.4142135623730951
```

```
print(pd.Series([i for i in range(5)]).std(ddof=1))
print(np.std([i for i in range(5)], ddof=1))
```

```
1.5811388300841898
1.5811388300841898
```

When calculated according to the definitions, the results are as follows and they match.

```
np.sqrt(np.sum(np.array([i * i for i in range(5)]) - np.power(np.mean([i for i in range(5)]), 2)) / 5)
```

```
1.4142135623730951
```

```
np.sqrt(np.sum(np.array([i * i for i in range(5)]) - np.power(np.mean([i for i in range(5)]), 2)) / 4)
```

```
1.5811388300841898
```

Although the difference becomes negligible with larger datasets, I took these notes because I encountered slightly different results.