[python] Using pandas reindex and date_range to fill in missing time series data
library python pandas
Published : 2022-03-30   Lastmod : 2022-07-17

Use pandas reindex and date_range to fill in missing time series data

When analyzing data from e-commerce site sales, there are cases where data for holidays is missing. Since missing dates can be inconvenient when analyzing, pandas reindex and date_range are used to fill in the missing data with some values such as 0.

github

  • The file in jupyter notebook format on github is here

google colaboratory

  • If you want to run it on google colaboratory, here

Execution environment

sw_vers
ProductName: macOS
ProductVersion: 11.6.7
BuildVersion: 20G630
!python -V
Python 3.8.13
Python 3.8.13

Prepare a DataFrame with time-series data and data holes.

import pandas as pd

df = pd.DataFrame({
    'sales': [i + 1 for i in range(5)],
    'date': pd.to_datetime(['2022-07-01', '2022-07-02', '2022-07-05', '2022-07-06', '2022-07-09'])
})
df
salesdate
012022-07-01
122022-07-02
232022-07-05
342022-07-06
452022-07-09
This data is often seen when a store is closed on weekends or has regular holidays.
The purpose of this project is to fill in the holes in the dates of time-series data, which can be inconvenient when analyzing the data.

date_range

Pandas has a function called date_range that creates continuous daily data. All you have to do is to set start and end and specify frequency. If you set freq to 60min, you can create a list of datetime type every hour, and if you set 240min, you can create a list of datetime type every 4 hours.

pd.date_range('2022-07-01', '2022-07-02', freq='60min')
DatetimeIndex(['2022-07-01 00:00:00', '2022-07-01 01:00:00',
               '2022-07-01 02:00:00', '2022-07-01 03:00:00',
               '2022-07-01 04:00:00', '2022-07-01 05:00:00',
               '2022-07-01 06:00:00', '2022-07-01 07:00:00',
               '2022-07-01 08:00:00', '2022-07-01 09:00:00',
               '2022-07-01 10:00:00', '2022-07-01 11:00:00',
               '2022-07-01 12:00:00', '2022-07-01 13:00:00',
               '2022-07-01 14:00:00', '2022-07-01 15:00:00',
               '2022-07-01 16:00:00', '2022-07-01 17:00:00',
               '2022-07-01 18:00:00', '2022-07-01 19:00:00',
               '2022-07-01 20:00:00', '2022-07-01 21:00:00',
               '2022-07-01 22:00:00', '2022-07-01 23:00:00',
               '2022-07-02 00:00:00'],
              dtype='datetime64[ns]', freq='60T')
pd.date_range('2022-07-01', '2022-07-02', freq='240min')
DatetimeIndex(['2022-07-01 00:00:00', '2022-07-01 04:00:00',
               '2022-07-01 08:00:00', '2022-07-01 12:00:00',
               '2022-07-01 16:00:00', '2022-07-01 20:00:00',
               '2022-07-02 00:00:00'],
              dtype='datetime64[ns]', freq='240T')

reindex

Use date_range and reindex to fill in missing data. If there is a value in the set index, the reindex is assigned that value, and if there is no value, NaN is assigned. However, the data to be filled can also be specified by fill_value. In this case, it is filled with 0.

start_time = df['date'].tolist()[0].
end_time = df['date'].tolist()[-1].

time_list = pd.date_range(start_time, end_time, freq='1d')
df.set_index('date').reindex(time_list, fill_value=0)
sales
2022-07-011
2022-07-022
2022-07-030
2022-07-040
2022-07-053
2022-07-064
2022-07-070
2022-07-080
2022-07-095

Missing data such as 7/3 and 7/4 are filled with zeros.

It does not make sense, but the data can be reordered according to the reindex argument list, so time_list[::-1] can be used to reverse the order.

df.set_index('date').reindex(time_list[::-1], fill_value=0)
sales
2022-07-095
2022-07-080
2022-07-070
2022-07-064
2022-07-053
2022-07-040
2022-07-030
2022-07-022
2022-07-011

Neither date_range nor reindex are used very often, so don’t forget them.

Related Articles