Filling Missing Time Series Data Using pandas reindex and date_range

When analyzing sales data from an e-commerce site, there might be missing data for holidays. During analysis, it can be inconvenient to have missing dates, so we use pandas’ reindex and date_range to fill these gaps with a specific value, such as 0.

github

  • The Jupyter notebook file is available on github here

google colaboratory

Execution Environment

!sw_vers
ProductName:	macOS
ProductVersion:	11.6.7
BuildVersion:	20G630
!python -V
Python 3.8.13

Prepare a DataFrame with missing time series data.

import pandas as pd

df = pd.DataFrame({
    'sales': [i + 1 for i in range(5)],
    'date': pd.to_datetime(['2022-07-01', '2022-07-02', '2022-07-05', '2022-07-06', '2022-07-09'])
})
df
salesdate
012022-07-01
122022-07-02
232022-07-05
342022-07-06
452022-07-09

Data like this is common for stores that are closed on weekends or have specific days off. Missing dates in time series data can be inconvenient during analysis, so our goal is to fill these gaps.

date_range

pandas has a function called date_range that creates a series of continuous daily data. You can set the start and end dates and specify the frequency. For example, setting freq to ‘60min’ creates a datetime list at hourly intervals, and ‘240min’ creates one at 4-hour intervals.

pd.date_range('2022-07-01', '2022-07-02', freq='60min')
DatetimeIndex(['2022-07-01 00:00:00', '2022-07-01 01:00:00',
               '2022-07-01 02:00:00', '2022-07-01 03:00:00',
               '2022-07-01 04:00:00', '2022-07-01 05:00:00',
               '2022-07-01 06:00:00', '2022-07-01 07:00:00',
               '2022-07-01 08:00:00', '2022-07-01 09:00:00',
               '2022-07-01 10:00:00', '2022-07-01 11:00:00',
               '2022-07-01 12:00:00', '2022-07-01 13:00:00',
               '2022-07-01 14:00:00', '2022-07-01 15:00:00',
               '2022-07-01 16:00:00', '2022-07-01 17:00:00',
               '2022-07-01 18:00:00', '2022-07-01 19:00:00',
               '2022-07-01 20:00:00', '2022-07-01 21:00:00',
               '2022-07-01 22:00:00', '2022-07-01 23:00:00',
               '2022-07-02 00:00:00'],
              dtype='datetime64[ns]', freq='60T')
pd.date_range('2022-07-01', '2022-07-02', freq='240min')
DatetimeIndex(['2022-07-01 00:00:00', '2022-07-01 04:00:00',
               '2022-07-01 08:00:00', '2022-07-01 12:00:00',
               '2022-07-01 16:00:00', '2022-07-01 20:00:00',
               '2022-07-02 00:00:00'],
              dtype='datetime64[ns]', freq='240T')

reindex

Using date_range and reindex, we can fill in missing data. reindex assigns values where the index is set and assigns NaN where values are missing. You can specify the fill value for missing data using the fill_value parameter. In this case, we’ll fill with 0.

start_time = df['date'].tolist()[0]
end_time = df['date'].tolist()[-1]

time_list = pd.date_range(start_time, end_time, freq='1d')
df.set_index('date').reindex(time_list, fill_value=0)
sales
2022-07-011
2022-07-022
2022-07-030
2022-07-040
2022-07-053
2022-07-064
2022-07-070
2022-07-080
2022-07-095

Missing data for 7/3 and 7/4, etc., has been filled with 0.

Although it doesn’t make much sense, you can reorder the data based on the list provided to reindex. By using time_list[::-1], you can reverse the order.

df.set_index('date').reindex(time_list[::-1], fill_value=0)

sales

2022-07-09

5
2022-07-080
2022-07-070
2022-07-064
2022-07-053
2022-07-040
2022-07-030
2022-07-022
2022-07-011

Since both date_range and reindex are not used frequently, make sure to remember how to use them.