pandas and data analysis
pandas is an important tool that you must use in your data analysis. Whether you know how to use it or not, or whether you can move your hands quickly without having to do a Google search for what you want to do, will directly affect your ability as an engineer. In this section, I will explain the methods and how to use them that I think are important based on my experience with specific data. If I come across any other important usage, I will add it to this page.
github
- The file in jupyter notebook format is here
google colaboratory
- If you want to run it in google colaboratory here pandas_nb.ipynb)
Author’s environment
The author’s OS is macOS, and the options are different from Linux and Unix commands.
! sw_vers
```` ProductName: Mac OS X
ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G6020
```python
Python -V
```python !
Python 3.7.3
Import the basic libraries and check their versions.
```python
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib
import matplotlib.pyplot as plt
import scipy
import numpy as np
print('matplotlib version :', matplotlib.__version__)
print('scipy version :', scipy.__version__)
print('numpy version :', np.__version__)
matplotlib version : 3.0.3
scipy version : 1.4.1
numpy version : 1.16.2
import and version checking
``python import pandas as pd
print(‘pandas version :’, pd.version)
pandas version : 1.0.3
## Basic operations
## Loading and displaying data
The data we will be using is from [daniel's github](https://github.com/chendaniely/pandas_for_everyone), the repository of a famous author who has written a book on how to use pandas. [Basic coding for Python data analysis/machine learning! Introduction to using the pandas library](https://www.amazon.co.jp/dp/B07NZP6V29/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1). I also have it. I think it is very informative.
I think the data is the number of Ebola outbreaks (Cases) and deaths (Deaths).
I will use `read_csv` to read the CSV and display the first 5 lines.
Using ``python
import pandas as pd
df = pd.read_csv('. /country_timeseries.csv', sep=',')
df.head()
```
<div>.
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th { vertical-align: right; }
text-align: right;
}
</style
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Date</th>
<th>Day</th>.
<th>Cases_Guinea</th> <th>Cases_Guinea</th
<th>Cases_Liberia</th> <th>Cases_Sierra</th
<th>Cases_SierraLeone</th
<th>Cases_Nigeria</th
<th>Cases_Senegal</th
<th>Cases_UnitedStates</th
<th>Cases_Spain</th
<th>Cases_Mali</th
<th>Deaths_Guinea</th> <th>Deaths_Guinea</th
<th>Deaths_Liberia</th> <th>Deaths_Liberia</th
<th>Deaths_SierraLeone</th
<th>Deaths_Nigeria</th> <th>Deaths_SierraLeone</th
<th>Deaths_Senegal</th> <th>Deaths_Senegal</th
<th>Deaths_UnitedStates</th
<th>Deaths_Spain</th
<th>Deaths_Mali</th>.
</tr>.
</thead>
<tbody>.
<tr>.
<th>0</th>.
<td>1/5/2015</td>
<td>289</td>
<td>2776.0</td>
<td>NaN</td>
<td>10030.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>1786.0</td>
<td>NaN</td>
<td>2977.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>1/4/2015</td>
<td>288</td>
<td>2775.0</td>
<td>NaN</td>
<td>9780.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>1781.0</td>
<td>NaN</td>
<td>2943.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>1/3/2015</td>
<td>287</td>
<td>2769.0</td>
<td>8166.0</td>
<td>9722.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>1767.0</td>
<td>3496.0</td>
<td>2915.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th> <td>1/2/2015</td>
<td>1/2/2015</td>
<td>286</td>
<td>NaN</td>
<td>8157.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3496.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>12/31/2014</td>
<td>284</td>
<td>2730.0</td>
<td>8115.0</td>
<td>9633.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>1739.0</td>
<td>3471.0</td>
<td>2827.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
Display the last 5 data.
```python
df.tail()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Date</th>
<th>Day</th>
<th>Cases_Guinea</th>
<th>Cases_Liberia</th>
<th>Cases_SierraLeone</th>
<th>Cases_Nigeria</th>
<th>Cases_Senegal</th>
<th>Cases_UnitedStates</th>
<th>Cases_Spain</th>
<th>Cases_Mali</th>
<th>Deaths_Guinea</th>
<th>Deaths_Liberia</th>
<th>Deaths_SierraLeone</th>
<th>Deaths_Nigeria</th>
<th>Deaths_Senegal</th>
<th>Deaths_UnitedStates</th>
<th>Deaths_Spain</th>
<th>Deaths_Mali</th>
</tr>
</thead>
<tbody>
<tr>
<th>117</th>
<td>3/27/2014</td>
<td>5</td>
<td>103.0</td>
<td>8.0</td>
<td>6.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>66.0</td>
<td>6.0</td>
<td>5.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>118</th>
<td>3/26/2014</td>
<td>4</td>
<td>86.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>62.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>119</th>
<td>3/25/2014</td>
<td>3</td>
<td>86.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>60.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>120</th>
<td>3/24/2014</td>
<td>2</td>
<td>86.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>59.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>121</th>
<td>3/22/2014</td>
<td>0</td>
<td>49.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>29.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
### Check your data.
#### Get data type and other information
```python
df.info()
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 122 non-null object
1 Day 122 non-null int64
2 Cases_Guinea 93 non-null float64
3 Cases_Liberia 83 non-null float64
4 Cases_SierraLeone 87 non-null float64
5 Cases_Nigeria 38 non-null float64
6 Cases_Senegal 25 non-null float64
7 Cases_UnitedStates 18 non-null float64
8 Cases_Spain 16 non-null float64
9 Cases_Mali 12 non-null float64
10 Deaths_Guinea 92 non-null float64
11 Deaths_Liberia 81 non-null float64
12 Deaths_SierraLeone 87 non-null float64
13 Deaths_Nigeria 38 non-null float64
14 Deaths_Senegal 22 non-null float64
15 Deaths_UnitedStates 18 non-null float64
16 Deaths_Spain 16 non-null float64
17 Deaths_Mali 12 non-null float64
dtypes: float64(16), int64(1), object(1)
memory usage: 17.3+ KB
#### 大きさ(行数と列数)の確認
```python
df.shape
```
(122, 18)
#### インデックスの確認
```python
df.index
```
RangeIndex(start=0, stop=122, step=1)
#### カラム名の確認
```python
df.columns
```
Index(['Date', 'Day', 'Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone',
'Cases_Nigeria', 'Cases_Senegal', 'Cases_UnitedStates', 'Cases_Spain',
'Cases_Mali', 'Deaths_Guinea', 'Deaths_Liberia', 'Deaths_SierraLeone',
'Deaths_Nigeria', 'Deaths_Senegal', 'Deaths_UnitedStates',
'Deaths_Spain', 'Deaths_Mali'],
dtype='object')
#### Retrieving data for any column name
Specify a column name to display only the desired column.
```python
df[['Cases_UnitedStates','Deaths_UnitedStates']].head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Cases_UnitedStates</th>
<th>Deaths_UnitedStates</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
#### Retrieve data by specifying the number of rows and columns
```python
df.iloc[[6,7],[0,3]]
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Date</th>
<th>Cases_Liberia</th>
</tr>
</thead>
<tbody>
<tr>
<th>6</th>
<td>12/27/2014</td>
<td>NaN</td>
</tr>
<tr>
<th>7</th>
<td>12/24/2014</td>
<td>7977.0</td>
</tr>
</tbody>
</table>
</div>
#### Retrieve data that satisfies certain conditions
```python
df[df['Deaths_Liberia'] > 3000][['Deaths_Liberia']]
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Deaths_Liberia</th>
</tr>
</thead>
<tbody>
<tr>
<th>2</th>
<td>3496.0</td>
</tr>
<tr>
<th>3</th>
<td>3496.0</td>
</tr>
<tr>
<th>4</th>
<td>3471.0</td>
</tr>
<tr>
<th>5</th>
<td>3423.0</td>
</tr>
<tr>
<th>7</th>
<td>3413.0</td>
</tr>
<tr>
<th>9</th>
<td>3384.0</td>
</tr>
<tr>
<th>10</th>
<td>3376.0</td>
</tr>
<tr>
<th>12</th>
<td>3290.0</td>
</tr>
<tr>
<th>14</th>
<td>3177.0</td>
</tr>
<tr>
<th>16</th>
<td>3145.0</td>
</tr>
<tr>
<th>18</th>
<td>3016.0</td>
</tr>
</tbody>
</table>
</div>
#### Deleting a column
We are deleting the column Deaths_Guinea.
```python
df.drop(['Deaths_Guinea'], axis=1, inplace=True)
df.columns
```
Index(['Date', 'Day', 'Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone',
'Cases_Nigeria', 'Cases_Senegal', 'Cases_UnitedStates', 'Cases_Spain',
'Cases_Mali', 'Deaths_Liberia', 'Deaths_SierraLeone', 'Deaths_Nigeria',
'Deaths_Senegal', 'Deaths_UnitedStates', 'Deaths_Spain', 'Deaths_Mali'],
dtype='object')
#### Getting statistics
Using describe(), you can get the statistics for each column. This is powerful when you want to get a quick overview.
```python
df.describe()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Day</th>
<th>Cases_Guinea</th>
<th>Cases_Liberia</th>
<th>Cases_SierraLeone</th>
<th>Cases_Nigeria</th>
<th>Cases_Senegal</th>
<th>Cases_UnitedStates</th>
<th>Cases_Spain</th>
<th>Cases_Mali</th>
<th>Deaths_Liberia</th>
<th>Deaths_SierraLeone</th>
<th>Deaths_Nigeria</th>
<th>Deaths_Senegal</th>
<th>Deaths_UnitedStates</th>
<th>Deaths_Spain</th>
<th>Deaths_Mali</th>
</tr>
</thead>
<tbody>
<tr>
<th>count</th>
<td>122.000000</td>
<td>93.000000</td>
<td>83.000000</td>
<td>87.000000</td>
<td>38.000000</td>
<td>25.00</td>
<td>18.000000</td>
<td>16.0</td>
<td>12.000000</td>
<td>81.000000</td>
<td>87.000000</td>
<td>38.000000</td>
<td>22.0</td>
<td>18.000000</td>
<td>16.000000</td>
<td>12.000000</td>
</tr>
<tr>
<th>mean</th>
<td>144.778689</td>
<td>911.064516</td>
<td>2335.337349</td>
<td>2427.367816</td>
<td>16.736842</td>
<td>1.08</td>
<td>3.277778</td>
<td>1.0</td>
<td>3.500000</td>
<td>1101.209877</td>
<td>693.701149</td>
<td>6.131579</td>
<td>0.0</td>
<td>0.833333</td>
<td>0.187500</td>
<td>3.166667</td>
</tr>
<tr>
<th>std</th>
<td>89.316460</td>
<td>849.108801</td>
<td>2987.966721</td>
<td>3184.803996</td>
<td>5.998577</td>
<td>0.40</td>
<td>1.178511</td>
<td>0.0</td>
<td>2.746899</td>
<td>1297.208568</td>
<td>869.947073</td>
<td>2.781901</td>
<td>0.0</td>
<td>0.383482</td>
<td>0.403113</td>
<td>2.405801</td>
</tr>
<tr>
<th>min</th>
<td>0.000000</td>
<td>49.000000</td>
<td>3.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>1.00</td>
<td>1.000000</td>
<td>1.0</td>
<td>1.000000</td>
<td>2.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.0</td>
<td>0.000000</td>
<td>0.000000</td>
<td>1.000000</td>
</tr>
<tr>
<th>25%</th>
<td>66.250000</td>
<td>236.000000</td>
<td>25.500000</td>
<td>64.500000</td>
<td>15.000000</td>
<td>1.00</td>
<td>3.000000</td>
<td>1.0</td>
<td>1.000000</td>
<td>12.000000</td>
<td>6.000000</td>
<td>4.000000</td>
<td>0.0</td>
<td>1.000000</td>
<td>0.000000</td>
<td>1.000000</td>
</tr>
<tr>
<th>50%</th>
<td>150.000000</td>
<td>495.000000</td>
<td>516.000000</td>
<td>783.000000</td>
<td>20.000000</td>
<td>1.00</td>
<td>4.000000</td>
<td>1.0</td>
<td>2.500000</td>
<td>294.000000</td>
<td>334.000000</td>
<td>8.000000</td>
<td>0.0</td>
<td>1.000000</td>
<td>0.000000</td>
<td>2.000000</td>
</tr>
<tr>
<th>75%</th>
<td>219.500000</td>
<td>1519.000000</td>
<td>4162.500000</td>
<td>3801.000000</td>
<td>20.000000</td>
<td>1.00</td>
<td>4.000000</td>
<td>1.0</td>
<td>6.250000</td>
<td>2413.000000</td>
<td>1176.000000</td>
<td>8.000000</td>
<td>0.0</td>
<td>1.000000</td>
<td>0.000000</td>
<td>6.000000</td>
</tr>
<tr>
<th>max</th>
<td>289.000000</td>
<td>2776.000000</td>
<td>8166.000000</td>
<td>10030.000000</td>
<td>22.000000</td>
<td>3.00</td>
<td>4.000000</td>
<td>1.0</td>
<td>7.000000</td>
<td>3496.000000</td>
<td>2977.000000</td>
<td>8.000000</td>
<td>0.0</td>
<td>1.000000</td>
<td>1.000000</td>
<td>6.000000</td>
</tr>
</tbody>
</table>
</div>
You can also use the value_counts method to easily find the frequency of a value. It is a little difficult to understand because the data used in this project is a continuous quantity, but we can see that the number of data for 0.0 is 15. You can see that the number of data for 0.0 is 15, and all the other data are 1.
```python
df['Deaths_Liberia'].value_counts()
```
11.0 9
12.0 4
3496.0 2
13.0 2
24.0 2
..
2963.0 1
88.0 1
3413.0 1
3177.0 1
105.0 1
Name: Deaths_Liberia, Length: 66, dtype: int64
## Change the index to datetime type
Change the index to Date and overwrite it. For time-series data, it is often easier to analyze if the index is a date. However, rather than simply indexing it as a string, converting it to datetime type, which is a standard feature of panda, will make it more convenient for aggregation.
Change the index name "Date" to YYYYMMDD. Use the rename function.
```python
df.rename(columns={'Date':'YYYYMMDD'}, inplace=True)
df.set_index('YYYYMMDD', inplace=True)
df.index
```
Index(['1/5/2015', '1/4/2015', '1/3/2015', '1/2/2015', '12/31/2014',
'12/28/2014', '12/27/2014', '12/24/2014', '12/21/2014', '12/20/2014',
...
'4/4/2014', '4/1/2014', '3/31/2014', '3/29/2014', '3/28/2014',
'3/27/2014', '3/26/2014', '3/25/2014', '3/24/2014', '3/22/2014'],
dtype='object', name='YYYYMMDD', length=122)
```python
df.head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Day</th>
<th>Cases_Guinea</th>
<th>Cases_Liberia</th>
<th>Cases_SierraLeone</th>
<th>Cases_Nigeria</th>
<th>Cases_Senegal</th>
<th>Cases_UnitedStates</th>
<th>Cases_Spain</th>
<th>Cases_Mali</th>
<th>Deaths_Liberia</th>
<th>Deaths_SierraLeone</th>
<th>Deaths_Nigeria</th>
<th>Deaths_Senegal</th>
<th>Deaths_UnitedStates</th>
<th>Deaths_Spain</th>
<th>Deaths_Mali</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>1/5/2015</th>
<td>289</td>
<td>2776.0</td>
<td>NaN</td>
<td>10030.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>2977.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1/4/2015</th>
<td>288</td>
<td>2775.0</td>
<td>NaN</td>
<td>9780.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>2943.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1/3/2015</th>
<td>287</td>
<td>2769.0</td>
<td>8166.0</td>
<td>9722.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3496.0</td>
<td>2915.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1/2/2015</th>
<td>286</td>
<td>NaN</td>
<td>8157.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3496.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>12/31/2014</th>
<td>284</td>
<td>2730.0</td>
<td>8115.0</td>
<td>9633.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3471.0</td>
<td>2827.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
```python
df.columns
```
Index(['Day', 'Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone',
'Cases_Nigeria', 'Cases_Senegal', 'Cases_UnitedStates', 'Cases_Spain',
'Cases_Mali', 'Deaths_Liberia', 'Deaths_SierraLeone', 'Deaths_Nigeria',
'Deaths_Senegal', 'Deaths_UnitedStates', 'Deaths_Spain', 'Deaths_Mali'],
dtype='object')
Sort by index. However, the date is a string object, so it is not sorted as I had planned.
```python
df.sort_index(ascending=True).head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Day</th>
<th>Cases_Guinea</th>
<th>Cases_Liberia</th>
<th>Cases_SierraLeone</th>
<th>Cases_Nigeria</th>
<th>Cases_Senegal</th>
<th>Cases_UnitedStates</th>
<th>Cases_Spain</th>
<th>Cases_Mali</th>
<th>Deaths_Liberia</th>
<th>Deaths_SierraLeone</th>
<th>Deaths_Nigeria</th>
<th>Deaths_Senegal</th>
<th>Deaths_UnitedStates</th>
<th>Deaths_Spain</th>
<th>Deaths_Mali</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>1/2/2015</th>
<td>286</td>
<td>NaN</td>
<td>8157.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3496.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1/3/2015</th>
<td>287</td>
<td>2769.0</td>
<td>8166.0</td>
<td>9722.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3496.0</td>
<td>2915.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1/4/2015</th>
<td>288</td>
<td>2775.0</td>
<td>NaN</td>
<td>9780.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>2943.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1/5/2015</th>
<td>289</td>
<td>2776.0</td>
<td>NaN</td>
<td>10030.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>2977.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>10/1/2014</th>
<td>193</td>
<td>1199.0</td>
<td>3834.0</td>
<td>2437.0</td>
<td>20.0</td>
<td>1.0</td>
<td>1.0</td>
<td>NaN</td>
<td>NaN</td>
<td>2069.0</td>
<td>623.0</td>
<td>8.0</td>
<td>0.0</td>
<td>0.0</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
Change the index to a datetime type.
```python
df.index
```
Index(['1/5/2015', '1/4/2015', '1/3/2015', '1/2/2015', '12/31/2014',
'12/28/2014', '12/27/2014', '12/24/2014', '12/21/2014', '12/20/2014',
...
'4/4/2014', '4/1/2014', '3/31/2014', '3/29/2014', '3/28/2014',
'3/27/2014', '3/26/2014', '3/25/2014', '3/24/2014', '3/22/2014'],
dtype='object', name='YYYYMMDD', length=122)
```python
df.index = pd.to_datetime(df.index, format='%m/%d/%Y')
df.index
```
DatetimeIndex(['2015-01-05', '2015-01-04', '2015-01-03', '2015-01-02',
'2014-12-31', '2014-12-28', '2014-12-27', '2014-12-24',
'2014-12-21', '2014-12-20',
...
'2014-04-04', '2014-04-01', '2014-03-31', '2014-03-29',
'2014-03-28', '2014-03-27', '2014-03-26', '2014-03-25',
'2014-03-24', '2014-03-22'],
dtype='datetime64[ns]', name='YYYYMMDD', length=122, freq=None)
We can see that the type has been changed from dtype='object' to object='datetime64', which is a datetime type. So let's sort it.
```python
df.sort_index(ascending=True, inplace=True)
df.head(10)
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Day</th>
<th>Cases_Guinea</th>
<th>Cases_Liberia</th>
<th>Cases_SierraLeone</th>
<th>Cases_Nigeria</th>
<th>Cases_Senegal</th>
<th>Cases_UnitedStates</th>
<th>Cases_Spain</th>
<th>Cases_Mali</th>
<th>Deaths_Liberia</th>
<th>Deaths_SierraLeone</th>
<th>Deaths_Nigeria</th>
<th>Deaths_Senegal</th>
<th>Deaths_UnitedStates</th>
<th>Deaths_Spain</th>
<th>Deaths_Mali</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2014-03-22</th>
<td>0</td>
<td>49.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-03-24</th>
<td>2</td>
<td>86.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-03-25</th>
<td>3</td>
<td>86.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-03-26</th>
<td>4</td>
<td>86.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-03-27</th>
<td>5</td>
<td>103.0</td>
<td>8.0</td>
<td>6.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>6.0</td>
<td>5.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-03-28</th>
<td>6</td>
<td>112.0</td>
<td>3.0</td>
<td>2.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3.0</td>
<td>2.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-03-29</th>
<td>7</td>
<td>112.0</td>
<td>7.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>2.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-03-31</th>
<td>9</td>
<td>122.0</td>
<td>8.0</td>
<td>2.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>4.0</td>
<td>2.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-04-01</th>
<td>10</td>
<td>127.0</td>
<td>8.0</td>
<td>2.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>5.0</td>
<td>2.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-04-04</th>
<td>13</td>
<td>143.0</td>
<td>18.0</td>
<td>2.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>7.0</td>
<td>2.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
```python
df.tail(10)
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Day</th>
<th>Cases_Guinea</th>
<th>Cases_Liberia</th>
<th>Cases_SierraLeone</th>
<th>Cases_Nigeria</th>
<th>Cases_Senegal</th>
<th>Cases_UnitedStates</th>
<th>Cases_Spain</th>
<th>Cases_Mali</th>
<th>Deaths_Liberia</th>
<th>Deaths_SierraLeone</th>
<th>Deaths_Nigeria</th>
<th>Deaths_Senegal</th>
<th>Deaths_UnitedStates</th>
<th>Deaths_Spain</th>
<th>Deaths_Mali</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2014-12-20</th>
<td>272</td>
<td>2571.0</td>
<td>7862.0</td>
<td>8939.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3384.0</td>
<td>2556.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-12-21</th>
<td>273</td>
<td>2597.0</td>
<td>NaN</td>
<td>9004.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>2582.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-12-24</th>
<td>277</td>
<td>2630.0</td>
<td>7977.0</td>
<td>9203.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3413.0</td>
<td>2655.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-12-27</th>
<td>280</td>
<td>2695.0</td>
<td>NaN</td>
<td>9409.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>2732.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-12-28</th>
<td>281</td>
<td>2706.0</td>
<td>8018.0</td>
<td>9446.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3423.0</td>
<td>2758.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-12-31</th>
<td>284</td>
<td>2730.0</td>
<td>8115.0</td>
<td>9633.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3471.0</td>
<td>2827.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2015-01-02</th>
<td>286</td>
<td>NaN</td>
<td>8157.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3496.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2015-01-03</th>
<td>287</td>
<td>2769.0</td>
<td>8166.0</td>
<td>9722.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3496.0</td>
<td>2915.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2015-01-04</th>
<td>288</td>
<td>2775.0</td>
<td>NaN</td>
<td>9780.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>2943.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2015-01-05</th>
<td>289</td>
<td>2776.0</td>
<td>NaN</td>
<td>10030.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>2977.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
This shows that the sorting is working as expected.
Also, since the datetime type is now set as the index, it is easier to handle dates.
For example, to get the data for the year 2015, we can use the following
```python
df['2015']
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Day</th>
<th>Cases_Guinea</th>
<th>Cases_Liberia</th>
<th>Cases_SierraLeone</th>
<th>Cases_Nigeria</th>
<th>Cases_Senegal</th>
<th>Cases_UnitedStates</th>
<th>Cases_Spain</th>
<th>Cases_Mali</th>
<th>Deaths_Liberia</th>
<th>Deaths_SierraLeone</th>
<th>Deaths_Nigeria</th>
<th>Deaths_Senegal</th>
<th>Deaths_UnitedStates</th>
<th>Deaths_Spain</th>
<th>Deaths_Mali</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2015-01-02</th>
<td>286</td>
<td>NaN</td>
<td>8157.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3496.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2015-01-03</th>
<td>287</td>
<td>2769.0</td>
<td>8166.0</td>
<td>9722.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3496.0</td>
<td>2915.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2015-01-04</th>
<td>288</td>
<td>2775.0</td>
<td>NaN</td>
<td>9780.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>2943.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2015-01-05</th>
<td>289</td>
<td>2776.0</td>
<td>NaN</td>
<td>10030.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>2977.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
```python
df['2014-12'].sort_index(ascending=True)
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Day</th>
<th>Cases_Guinea</th>
<th>Cases_Liberia</th>
<th>Cases_SierraLeone</th>
<th>Cases_Nigeria</th>
<th>Cases_Senegal</th>
<th>Cases_UnitedStates</th>
<th>Cases_Spain</th>
<th>Cases_Mali</th>
<th>Deaths_Liberia</th>
<th>Deaths_SierraLeone</th>
<th>Deaths_Nigeria</th>
<th>Deaths_Senegal</th>
<th>Deaths_UnitedStates</th>
<th>Deaths_Spain</th>
<th>Deaths_Mali</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2014-12-03</th>
<td>256</td>
<td>NaN</td>
<td>7719.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3177.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-12-07</th>
<td>260</td>
<td>2292.0</td>
<td>NaN</td>
<td>7897.0</td>
<td>20.0</td>
<td>1.0</td>
<td>4.0</td>
<td>1.0</td>
<td>7.0</td>
<td>NaN</td>
<td>1768.0</td>
<td>8.0</td>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
<td>6.0</td>
</tr>
<tr>
<th>2014-12-09</th>
<td>262</td>
<td>NaN</td>
<td>7797.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3290.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-12-14</th>
<td>267</td>
<td>2416.0</td>
<td>NaN</td>
<td>8356.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>2085.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-12-18</th>
<td>271</td>
<td>NaN</td>
<td>7830.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3376.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-12-20</th>
<td>272</td>
<td>2571.0</td>
<td>7862.0</td>
<td>8939.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3384.0</td>
<td>2556.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-12-21</th>
<td>273</td>
<td>2597.0</td>
<td>NaN</td>
<td>9004.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>2582.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-12-24</th>
<td>277</td>
<td>2630.0</td>
<td>7977.0</td>
<td>9203.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3413.0</td>
<td>2655.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-12-27</th>
<td>280</td>
<td>2695.0</td>
<td>NaN</td>
<td>9409.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>2732.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-12-28</th>
<td>281</td>
<td>2706.0</td>
<td>8018.0</td>
<td>9446.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3423.0</td>
<td>2758.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-12-31</th>
<td>284</td>
<td>2730.0</td>
<td>8115.0</td>
<td>9633.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3471.0</td>
<td>2827.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
Furthermore, statistics such as averages and totals can be easily obtained on a yearly or monthly basis.
```python
df.resample('Y').mean()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Day</th>
<th>Cases_Guinea</th>
<th>Cases_Liberia</th>
<th>Cases_SierraLeone</th>
<th>Cases_Nigeria</th>
<th>Cases_Senegal</th>
<th>Cases_UnitedStates</th>
<th>Cases_Spain</th>
<th>Cases_Mali</th>
<th>Deaths_Liberia</th>
<th>Deaths_SierraLeone</th>
<th>Deaths_Nigeria</th>
<th>Deaths_Senegal</th>
<th>Deaths_UnitedStates</th>
<th>Deaths_Spain</th>
<th>Deaths_Mali</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2014-12-31</th>
<td>139.940678</td>
<td>848.988889</td>
<td>2191.481481</td>
<td>2162.488095</td>
<td>16.736842</td>
<td>1.08</td>
<td>3.277778</td>
<td>1.0</td>
<td>3.5</td>
<td>1040.582278</td>
<td>613.297619</td>
<td>6.131579</td>
<td>0.0</td>
<td>0.833333</td>
<td>0.1875</td>
<td>3.166667</td>
</tr>
<tr>
<th>2015-12-31</th>
<td>287.500000</td>
<td>2773.333333</td>
<td>8161.500000</td>
<td>9844.000000</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3496.000000</td>
<td>2945.000000</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
```python
df.resample('M').mean()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Day</th>
<th>Cases_Guinea</th>
<th>Cases_Liberia</th>
<th>Cases_SierraLeone</th>
<th>Cases_Nigeria</th>
<th>Cases_Senegal</th>
<th>Cases_UnitedStates</th>
<th>Cases_Spain</th>
<th>Cases_Mali</th>
<th>Deaths_Liberia</th>
<th>Deaths_SierraLeone</th>
<th>Deaths_Nigeria</th>
<th>Deaths_Senegal</th>
<th>Deaths_UnitedStates</th>
<th>Deaths_Spain</th>
<th>Deaths_Mali</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2014-03-31</th>
<td>4.500000</td>
<td>94.500000</td>
<td>6.500000</td>
<td>3.333333</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3.750000</td>
<td>3.000000</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-04-30</th>
<td>24.333333</td>
<td>177.818182</td>
<td>24.555556</td>
<td>2.200000</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>9.625000</td>
<td>1.111111</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-05-31</th>
<td>51.888889</td>
<td>248.777778</td>
<td>12.555556</td>
<td>7.333333</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>11.111111</td>
<td>1.222222</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-06-30</th>
<td>84.636364</td>
<td>373.428571</td>
<td>35.500000</td>
<td>125.571429</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>28.000000</td>
<td>29.375000</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-07-31</th>
<td>115.700000</td>
<td>423.000000</td>
<td>212.300000</td>
<td>420.500000</td>
<td>1.333333</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>121.300000</td>
<td>189.500000</td>
<td>0.666667</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-08-31</th>
<td>145.090909</td>
<td>559.818182</td>
<td>868.818182</td>
<td>844.000000</td>
<td>13.363636</td>
<td>1.000000</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>468.454545</td>
<td>353.000000</td>
<td>3.545455</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-09-30</th>
<td>177.500000</td>
<td>967.888889</td>
<td>2815.625000</td>
<td>1726.000000</td>
<td>20.714286</td>
<td>1.285714</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>1508.000000</td>
<td>565.777778</td>
<td>8.000000</td>
<td>0.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2014-10-31</th>
<td>207.470588</td>
<td>1500.444444</td>
<td>4758.750000</td>
<td>3668.111111</td>
<td>20.000000</td>
<td>1.000000</td>
<td>2.555556</td>
<td>1.0</td>
<td>1.0</td>
<td>2419.000000</td>
<td>1151.666667</td>
<td>8.000000</td>
<td>0.0</td>
<td>0.666667</td>
<td>0.428571</td>
<td>1.000</td>
</tr>
<tr>
<th>2014-11-30</th>
<td>237.214286</td>
<td>1950.500000</td>
<td>7039.000000</td>
<td>5843.625000</td>
<td>20.000000</td>
<td>1.000000</td>
<td>4.000000</td>
<td>1.0</td>
<td>4.0</td>
<td>2928.857143</td>
<td>1256.750000</td>
<td>8.000000</td>
<td>0.0</td>
<td>1.000000</td>
<td>0.000000</td>
<td>3.625</td>
</tr>
<tr>
<th>2014-12-31</th>
<td>271.181818</td>
<td>2579.625000</td>
<td>7902.571429</td>
<td>8985.875000</td>
<td>20.000000</td>
<td>1.000000</td>
<td>4.000000</td>
<td>1.0</td>
<td>7.0</td>
<td>3362.000000</td>
<td>2495.375000</td>
<td>8.000000</td>
<td>0.0</td>
<td>1.000000</td>
<td>0.000000</td>
<td>6.000</td>
</tr>
<tr>
<th>2015-01-31</th>
<td>287.500000</td>
<td>2773.333333</td>
<td>8161.500000</td>
<td>9844.000000</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>3496.000000</td>
<td>2945.000000</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
```python
df.resample('Y').sum()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Day</th>
<th>Cases_Guinea</th>
<th>Cases_Liberia</th>
<th>Cases_SierraLeone</th>
<th>Cases_Nigeria</th>
<th>Cases_Senegal</th>
<th>Cases_UnitedStates</th>
<th>Cases_Spain</th>
<th>Cases_Mali</th>
<th>Deaths_Liberia</th>
<th>Deaths_SierraLeone</th>
<th>Deaths_Nigeria</th>
<th>Deaths_Senegal</th>
<th>Deaths_UnitedStates</th>
<th>Deaths_Spain</th>
<th>Deaths_Mali</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2014-12-31</th>
<td>16513</td>
<td>76409.0</td>
<td>177510.0</td>
<td>181649.0</td>
<td>636.0</td>
<td>27.0</td>
<td>59.0</td>
<td>16.0</td>
<td>42.0</td>
<td>82206.0</td>
<td>51517.0</td>
<td>233.0</td>
<td>0.0</td>
<td>15.0</td>
<td>3.0</td>
<td>38.0</td>
</tr>
<tr>
<th>2015-12-31</th>
<td>1150</td>
<td>8320.0</td>
<td>16323.0</td>
<td>29532.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>6992.0</td>
<td>8835.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
</tbody>
</table>
</div>
```python
df.resample('M').sum()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Day</th>
<th>Cases_Guinea</th>
<th>Cases_Liberia</th>
<th>Cases_SierraLeone</th>
<th>Cases_Nigeria</th>
<th>Cases_Senegal</th>
<th>Cases_UnitedStates</th>
<th>Cases_Spain</th>
<th>Cases_Mali</th>
<th>Deaths_Liberia</th>
<th>Deaths_SierraLeone</th>
<th>Deaths_Nigeria</th>
<th>Deaths_Senegal</th>
<th>Deaths_UnitedStates</th>
<th>Deaths_Spain</th>
<th>Deaths_Mali</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2014-03-31</th>
<td>36</td>
<td>756.0</td>
<td>26.0</td>
<td>10.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>15.0</td>
<td>9.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>2014-04-30</th>
<td>365</td>
<td>1956.0</td>
<td>221.0</td>
<td>22.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>77.0</td>
<td>10.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>2014-05-31</th>
<td>467</td>
<td>2239.0</td>
<td>113.0</td>
<td>66.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>100.0</td>
<td>11.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>2014-06-30</th>
<td>931</td>
<td>2614.0</td>
<td>284.0</td>
<td>879.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>196.0</td>
<td>235.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>2014-07-31</th>
<td>1157</td>
<td>4230.0</td>
<td>2123.0</td>
<td>4205.0</td>
<td>4.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>1213.0</td>
<td>1895.0</td>
<td>2.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>2014-08-31</th>
<td>1596</td>
<td>6158.0</td>
<td>9557.0</td>
<td>9284.0</td>
<td>147.0</td>
<td>1.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>5153.0</td>
<td>3883.0</td>
<td>39.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>2014-09-30</th>
<td>2130</td>
<td>8711.0</td>
<td>22525.0</td>
<td>15534.0</td>
<td>145.0</td>
<td>9.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>12064.0</td>
<td>5092.0</td>
<td>56.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>2014-10-31</th>
<td>3527</td>
<td>13504.0</td>
<td>38070.0</td>
<td>33013.0</td>
<td>160.0</td>
<td>8.0</td>
<td>23.0</td>
<td>7.0</td>
<td>3.0</td>
<td>19352.0</td>
<td>10365.0</td>
<td>64.0</td>
<td>0.0</td>
<td>6.0</td>
<td>3.0</td>
<td>3.0</td>
</tr>
<tr>
<th>2014-11-30</th>
<td>3321</td>
<td>15604.0</td>
<td>49273.0</td>
<td>46749.0</td>
<td>160.0</td>
<td>8.0</td>
<td>32.0</td>
<td>8.0</td>
<td>32.0</td>
<td>20502.0</td>
<td>10054.0</td>
<td>64.0</td>
<td>0.0</td>
<td>8.0</td>
<td>0.0</td>
<td>29.0</td>
</tr>
<tr>
<th>2014-12-31</th>
<td>2983</td>
<td>20637.0</td>
<td>55318.0</td>
<td>71887.0</td>
<td>20.0</td>
<td>1.0</td>
<td>4.0</td>
<td>1.0</td>
<td>7.0</td>
<td>23534.0</td>
<td>19963.0</td>
<td>8.0</td>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
<td>6.0</td>
</tr>
<tr>
<th>2015-01-31</th>
<td>1150</td>
<td>8320.0</td>
<td>16323.0</td>
<td>29532.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>6992.0</td>
<td>8835.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
</tbody>
</table>
</div>
It is very useful. Yet another advantage of the datetime type is that you can use methods such as `.year` and `.month` to get the year and month of the year.
```python
df.index.year
```
Int64Index([2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014,
...
2014, 2014, 2014, 2014, 2014, 2014, 2015, 2015, 2015, 2015],
dtype='int64', name='YYYYMMDD', length=122)
```python
df.index.month
```
Int64Index([ 3, 3, 3, 3, 3, 3, 3, 3, 4, 4,
...
12, 12, 12, 12, 12, 12, 1, 1, 1, 1],
dtype='int64', name='YYYYMMDD', length=122)
```python
df.index.day
```
Int64Index([22, 24, 25, 26, 27, 28, 29, 31, 1, 4,
...
20, 21, 24, 27, 28, 31, 2, 3, 4, 5],
dtype='int64', name='YYYYMMDD', length=122)
## Cut Processing (Creating a Histogram)
When you are analyzing data, there are often situations where you want to divide the data under certain conditions and aggregate them. For example, you may want to aggregate the data by season. Here's an example of labeling by season, though the months and seasons don't match.
```python
labels = ['spring', 'summer', 'fall', 'winter'].
df['season'] = pd.cut(list(df.index.month), bins=[0,3,6,9,12], labels=labels, right=True)
df[['season']][5:10]]
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>season</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2014-03-28</th>
<td>春</td>
</tr>
<tr>
<th>2014-03-29</th>
<td>春</td>
</tr>
<tr>
<th>2014-03-31</th>
<td>春</td>
</tr>
<tr>
<th>2014-04-01</th>
<td>夏</td>
</tr>
<tr>
<th>2014-04-04</th>
<td>夏</td>
</tr>
</tbody>
</table>
</div>
```python
df[['season']][73:78]
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>season</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2014-09-21</th>
<td>秋</td>
</tr>
<tr>
<th>2014-09-23</th>
<td>秋</td>
</tr>
<tr>
<th>2014-09-28</th>
<td>秋</td>
</tr>
<tr>
<th>2014-10-01</th>
<td>冬</td>
</tr>
<tr>
<th>2014-10-04</th>
<td>冬</td>
</tr>
</tbody>
</table>
</div>
## How to use query, where, and mask (and sort)
You can use query, where, etc. in the same way as numpy. The usage is intuitively the same as numpy, so you can use it right away. Let's run a query on the number of infected and dead.
The query specifies the conditional expression to be extracted.
```python
df[['Deaths_Liberia','Cases_Liberia']].query('Deaths_Liberia > 100 and Cases_Liberia > 7000')
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Deaths_Liberia</th>
<th>Cases_Liberia</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2014-11-15</th>
<td>2964.0</td>
<td>7069.0</td>
</tr>
<tr>
<th>2014-11-18</th>
<td>2963.0</td>
<td>7082.0</td>
</tr>
<tr>
<th>2014-11-22</th>
<td>3016.0</td>
<td>7168.0</td>
</tr>
<tr>
<th>2014-11-28</th>
<td>3145.0</td>
<td>7635.0</td>
</tr>
<tr>
<th>2014-12-03</th>
<td>3177.0</td>
<td>7719.0</td>
</tr>
<tr>
<th>2014-12-09</th>
<td>3290.0</td>
<td>7797.0</td>
</tr>
<tr>
<th>2014-12-18</th>
<td>3376.0</td>
<td>7830.0</td>
</tr>
<tr>
<th>2014-12-20</th>
<td>3384.0</td>
<td>7862.0</td>
</tr>
<tr>
<th>2014-12-24</th>
<td>3413.0</td>
<td>7977.0</td>
</tr>
<tr>
<th>2014-12-28</th>
<td>3423.0</td>
<td>8018.0</td>
</tr>
<tr>
<th>2014-12-31</th>
<td>3471.0</td>
<td>8115.0</td>
</tr>
<tr>
<th>2015-01-02</th>
<td>3496.0</td>
<td>8157.0</td>
</tr>
<tr>
<th>2015-01-03</th>
<td>3496.0</td>
<td>8166.0</td>
</tr>
</tbody>
</table>
</div>
If the where condition is also specified, data that satisfies the condition will be returned as is, and data that does not satisfy the condition will be returned with NaN stored in it.
```python
df[['Deaths_Liberia']].where(df['Deaths_Liberia'] > 1000)
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Deaths_Liberia</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2014-03-22</th>
<td>NaN</td>
</tr>
<tr>
<th>2014-03-24</th>
<td>NaN</td>
</tr>
<tr>
<th>2014-03-25</th>
<td>NaN</td>
</tr>
<tr>
<th>2014-03-26</th>
<td>NaN</td>
</tr>
<tr>
<th>2014-03-27</th>
<td>NaN</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
</tr>
<tr>
<th>2014-12-31</th>
<td>3471.0</td>
</tr>
<tr>
<th>2015-01-02</th>
<td>3496.0</td>
</tr>
<tr>
<th>2015-01-03</th>
<td>3496.0</td>
</tr>
<tr>
<th>2015-01-04</th>
<td>NaN</td>
</tr>
<tr>
<th>2015-01-05</th>
<td>NaN</td>
</tr>
</tbody>
</table>
<p>122 rows × 1 columns</p>
</div>
You can also put another number instead of NaN. This is the same as in numpy and is very helpful.
```python
df[['Deaths_Liberia']].where(df['Deaths_Liberia'] > 3000,0)
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Deaths_Liberia</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2014-03-22</th>
<td>0.0</td>
</tr>
<tr>
<th>2014-03-24</th>
<td>0.0</td>
</tr>
<tr>
<th>2014-03-25</th>
<td>0.0</td>
</tr>
<tr>
<th>2014-03-26</th>
<td>0.0</td>
</tr>
<tr>
<th>2014-03-27</th>
<td>0.0</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
</tr>
<tr>
<th>2014-12-31</th>
<td>3471.0</td>
</tr>
<tr>
<th>2015-01-02</th>
<td>3496.0</td>
</tr>
<tr>
<th>2015-01-03</th>
<td>3496.0</td>
</tr>
<tr>
<th>2015-01-04</th>
<td>0.0</td>
</tr>
<tr>
<th>2015-01-05</th>
<td>0.0</td>
</tr>
</tbody>
</table>
<p>122 rows × 1 columns</p>
</div>
The mask method is the opposite of where, and rewrites the second argument with the one that satisfies the condition.
```python
df[['Deaths_Liberia']].mask(df['Deaths_Liberia'] > 3000, 0)
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Deaths_Liberia</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2014-03-22</th>
<td>NaN</td>
</tr>
<tr>
<th>2014-03-24</th>
<td>NaN</td>
</tr>
<tr>
<th>2014-03-25</th>
<td>NaN</td>
</tr>
<tr>
<th>2014-03-26</th>
<td>NaN</td>
</tr>
<tr>
<th>2014-03-27</th>
<td>6.0</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
</tr>
<tr>
<th>2014-12-31</th>
<td>0.0</td>
</tr>
<tr>
<th>2015-01-02</th>
<td>0.0</td>
</tr>
<tr>
<th>2015-01-03</th>
<td>0.0</td>
</tr>
<tr>
<th>2015-01-04</th>
<td>NaN</td>
</tr>
<tr>
<th>2015-01-05</th>
<td>NaN</td>
</tr>
</tbody>
</table>
<p>122 rows × 1 columns</p>
</div>
## How to use null
Data often contains nulls, so for proper data analysis, we need to know how many nulls are in the data and how much they affect the analysis.
With `isnull`, you can create a table with the null part set to false.
```python
df.isnull()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Day</th>
<th>Cases_Guinea</th>
<th>Cases_Liberia</th>
<th>Cases_SierraLeone</th>
<th>Cases_Nigeria</th>
<th>Cases_Senegal</th>
<th>Cases_UnitedStates</th>
<th>Cases_Spain</th>
<th>Cases_Mali</th>
<th>Deaths_Liberia</th>
<th>Deaths_SierraLeone</th>
<th>Deaths_Nigeria</th>
<th>Deaths_Senegal</th>
<th>Deaths_UnitedStates</th>
<th>Deaths_Spain</th>
<th>Deaths_Mali</th>
<th>season</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2014-03-22</th>
<td>False</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>2014-03-24</th>
<td>False</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>2014-03-25</th>
<td>False</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>2014-03-26</th>
<td>False</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>2014-03-27</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>2014-12-31</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>2015-01-02</th>
<td>False</td>
<td>True</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>2015-01-03</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>2015-01-04</th>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>2015-01-05</th>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
</tr>
</tbody>
</table>
<p>122 rows × 17 columns</p>
</div>
You can also use the sum method to count the number of nulls in each column.
```python
df.isnull().sum()
```
Day 0
Cases_Guinea 29
Cases_Liberia 39
Cases_SierraLeone 35
Cases_Nigeria 84
Cases_Senegal 97
Cases_UnitedStates 104
Cases_Spain 106
Cases_Mali 110
Deaths_Liberia 41
Deaths_SierraLeone 35
Deaths_Nigeria 84
Deaths_Senegal 100
Deaths_UnitedStates 104
Deaths_Spain 106
Deaths_Mali 110
season 0
dtype: int64
Similarly, you can use the mean method to get the average.
```python
df.isnull().mean()
```
Day 0.000000
Cases_Guinea 0.237705
Cases_Liberia 0.319672
Cases_SierraLeone 0.286885
Cases_Nigeria 0.688525
Cases_Senegal 0.795082
Cases_UnitedStates 0.852459
Cases_Spain 0.868852
Cases_Mali 0.901639
Deaths_Liberia 0.336066
Deaths_SierraLeone 0.286885
Deaths_Nigeria 0.688525
Deaths_Senegal 0.819672
Deaths_UnitedStates 0.852459
Deaths_Spain 0.868852
Deaths_Mali 0.901639
season 0.000000
dtype: float64
Rewrites the data of Null. Use the method `fillna`.
```python
df.fillna(value={'Cases_Liberia': 0.0, 'Deaths_Liberia': 0.0}, inplace=True)
df.isnull().sum()
```
Day 0
Cases_Guinea 29
Cases_Liberia 0
Cases_SierraLeone 35
Cases_Nigeria 84
Cases_Senegal 97
Cases_UnitedStates 104
Cases_Spain 106
Cases_Mali 110
Deaths_Liberia 0
Deaths_SierraLeone 35
Deaths_Nigeria 84
Deaths_Senegal 100
Deaths_UnitedStates 104
Deaths_Spain 106
Deaths_Mali 110
season 0
dtype: int64
Now indeed, the number of nulls in Cases_Liberia and Death_Liberia is zero.
We can also delete rows that have nulls in certain columns, which can be seen by comparing the number of data before and after applying dropna. Before the deletion, the number of rows with
```python
df.shape
```
(122, 17)
The result is as follows. After the deletion, the following shows that the file has indeed been deleted.
```python
df.dropna(subset=['Cases_Nigeria'], axis=0).shape
```
(38, 17)
```python
df.dropna(subset=['Cases_Nigeria'], axis=0).isnull().sum()
```
Day 0
Cases_Guinea 2
Cases_Liberia 0
Cases_SierraLeone 0
Cases_Nigeria 0
Cases_Senegal 13
Cases_UnitedStates 21
Cases_Spain 23
Cases_Mali 27
Deaths_Liberia 0
Deaths_SierraLeone 0
Deaths_Nigeria 0
Deaths_Senegal 16
Deaths_UnitedStates 21
Deaths_Spain 23
Deaths_Mali 27
season 0
dtype: int64
```python
df.dropna(subset=['Cases_Nigeria'], axis=0).head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Day</th>
<th>Cases_Guinea</th>
<th>Cases_Liberia</th>
<th>Cases_SierraLeone</th>
<th>Cases_Nigeria</th>
<th>Cases_Senegal</th>
<th>Cases_UnitedStates</th>
<th>Cases_Spain</th>
<th>Cases_Mali</th>
<th>Deaths_Liberia</th>
<th>Deaths_SierraLeone</th>
<th>Deaths_Nigeria</th>
<th>Deaths_Senegal</th>
<th>Deaths_UnitedStates</th>
<th>Deaths_Spain</th>
<th>Deaths_Mali</th>
<th>season</th>
</tr>
<tr>
<th>YYYYMMDD</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2014-07-23</th>
<td>123</td>
<td>427.0</td>
<td>249.0</td>
<td>525.0</td>
<td>0.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>129.0</td>
<td>224.0</td>
<td>0.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>秋</td>
</tr>
<tr>
<th>2014-07-27</th>
<td>126</td>
<td>460.0</td>
<td>329.0</td>
<td>533.0</td>
<td>1.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>156.0</td>
<td>233.0</td>
<td>1.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>秋</td>
</tr>
<tr>
<th>2014-07-30</th>
<td>129</td>
<td>472.0</td>
<td>391.0</td>
<td>574.0</td>
<td>3.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>227.0</td>
<td>252.0</td>
<td>1.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>秋</td>
</tr>
<tr>
<th>2014-08-01</th>
<td>132</td>
<td>485.0</td>
<td>468.0</td>
<td>646.0</td>
<td>4.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>255.0</td>
<td>273.0</td>
<td>1.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>秋</td>
</tr>
<tr>
<th>2014-08-04</th>
<td>135</td>
<td>495.0</td>
<td>516.0</td>
<td>691.0</td>
<td>9.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>282.0</td>
<td>286.0</td>
<td>1.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>秋</td>
</tr>
</tbody>
</table>
</div>
## Renaming a column or index
We have already seen above that we often want to rename a column or index, so we use the rename method.
Use the ``python
df.rename(columns={'before': 'after'}, inplace=True)
df.rename(index={'before': 'after'}, inplace=True)
SQL like methods
SQL’s familiar groupby is available in pandas. I personally use this one a lot.
df.groupby(['season'])['season'].count()
season
spring 12
summer 35
autumn 33
winter 42
Name: season, dtype: int64
Output to CSV
Outputs all data stored in the memory.
df.to_csv('./out.csv')
!head -n 10 out.csv
YYYYMMDD,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali,season
2014-03-22,0,49.0,0.0,,,,,,,0.0,,,,,,,春
2014-03-24,2,86.0,0.0,,,,,,,0.0,,,,,,,春
2014-03-25,3,86.0,0.0,,,,,,,0.0,,,,,,,春
2014-03-26,4,86.0,0.0,,,,,,,0.0,,,,,,,春
2014-03-27,5,103.0,8.0,6.0,,,,,,6.0,5.0,,,,,,春
2014-03-28,6,112.0,3.0,2.0,,,,,,3.0,2.0,,,,,,春
2014-03-29,7,112.0,7.0,,,,,,,2.0,,,,,,,春
2014-03-31,9,122.0,8.0,2.0,,,,,,4.0,2.0,,,,,,春
2014-04-01,10,127.0,8.0,2.0,,,,,,5.0,2.0,,,,,,夏
df.to_csv('./out.csv', columns=['Deaths_Liberia'])
!head -n 10 out.csv
YYYYMMDD,Deaths_Liberia
2014-03-22,0.0
2014-03-24,0.0
2014-03-25,0.0
2014-03-26,0.0
2014-03-27,6.0
2014-03-28,3.0
2014-03-29,2.0
2014-03-31,4.0
2014-04-01,5.0
You can disable the header and index.
df.to_csv('./out.csv', header=False, index=False)
!head -n 10 out.csv
0,49.0,0.0,,,,,,,0.0,,,,,,,春
2,86.0,0.0,,,,,,,0.0,,,,,,,春
3,86.0,0.0,,,,,,,0.0,,,,,,,春
4,86.0,0.0,,,,,,,0.0,,,,,,,春
5,103.0,8.0,6.0,,,,,,6.0,5.0,,,,,,春
6,112.0,3.0,2.0,,,,,,3.0,2.0,,,,,,春
7,112.0,7.0,,,,,,,2.0,,,,,,,春
9,122.0,8.0,2.0,,,,,,4.0,2.0,,,,,,春
10,127.0,8.0,2.0,,,,,,5.0,2.0,,,,,,夏
13,143.0,18.0,2.0,,,,,,7.0,2.0,,,,,,夏
Other additions
Type conversion
Overwrite with a type. This is the notation for batch conversions. inplace is missing and I spent a bit of time on it, so I’m writing it down.
test = pd.DataFrame({'max':[1], 'min':[2], 'mean':[1.5]})
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 max 1 non-null int64
1 min 1 non-null int64
2 mean 1 non-null float64
dtypes: float64(1), int64(2)
memory usage: 152.0 bytes
test = test.astype({'max': float, 'min': float, 'mean': float})
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 max 1 non-null float64
1 min 1 non-null float64
2 mean 1 non-null float64
dtypes: float64(3)
memory usage: 152.0 bytes
Frequently used functions
As a final summary, here is a list of frequently used functions. It is like a personal snipet.
Change the index (to the name of an existing column)
df.set_index('xxxx')
Change column names
df.rename(columns={'before': 'after'}, inplace=True)
df.rename(index={'before': 'after'}, inplace=True)
Sort by a column
df.sort_values(by='xxx', ascending=True)
Sort by index
df.sort_index()
Type conversion for datetime types
df.to_datetime()
Number of NaN’s per column
df.isnull().sum()
References.
- Cheat sheet
- It explains all the arguments of read_csv B0%E3%82%92%E4%BD%BF%E3%81%84%E3%81%93%E3%81%AA%E3%81%99/)
- Pandas basic operations frequently encountered in data analysis This is a very good book that explains the important parts at a practical level. It is very helpful.