[python] Filling Missing Time Series Data Using pandas reindex and date

Filling Missing Time Series Data Using pandas reindex and date_range

When analyzing sales data from an e-commerce site, there might be missing data for holidays. During analysis, it can be inconvenient to have missing dates, so we use pandas’ reindex and date_range to fill these gaps with a specific value, such as 0.

github

The Jupyter notebook file is available on github here

google colaboratory

To run on Google Colaboratory, use this link

Execution Environment

!sw_vers

ProductName:	macOS
ProductVersion:	11.6.7
BuildVersion:	20G630

!python -V

Python 3.8.13

Prepare a DataFrame with missing time series data.

import pandas as pd

df = pd.DataFrame({
    'sales': [i + 1 for i in range(5)],
    'date': pd.to_datetime(['2022-07-01', '2022-07-02', '2022-07-05', '2022-07-06', '2022-07-09'])
})
df

	sales	date
0	1	2022-07-01
1	2	2022-07-02
2	3	2022-07-05
3	4	2022-07-06
4	5	2022-07-09

Data like this is common for stores that are closed on weekends or have specific days off. Missing dates in time series data can be inconvenient during analysis, so our goal is to fill these gaps.

date_range

pandas has a function called date_range that creates a series of continuous daily data. You can set the start and end dates and specify the frequency. For example, setting freq to ‘60min’ creates a datetime list at hourly intervals, and ‘240min’ creates one at 4-hour intervals.

pd.date_range('2022-07-01', '2022-07-02', freq='60min')

DatetimeIndex(['2022-07-01 00:00:00', '2022-07-01 01:00:00',
               '2022-07-01 02:00:00', '2022-07-01 03:00:00',
               '2022-07-01 04:00:00', '2022-07-01 05:00:00',
               '2022-07-01 06:00:00', '2022-07-01 07:00:00',
               '2022-07-01 08:00:00', '2022-07-01 09:00:00',
               '2022-07-01 10:00:00', '2022-07-01 11:00:00',
               '2022-07-01 12:00:00', '2022-07-01 13:00:00',
               '2022-07-01 14:00:00', '2022-07-01 15:00:00',
               '2022-07-01 16:00:00', '2022-07-01 17:00:00',
               '2022-07-01 18:00:00', '2022-07-01 19:00:00',
               '2022-07-01 20:00:00', '2022-07-01 21:00:00',
               '2022-07-01 22:00:00', '2022-07-01 23:00:00',
               '2022-07-02 00:00:00'],
              dtype='datetime64[ns]', freq='60T')

pd.date_range('2022-07-01', '2022-07-02', freq='240min')

DatetimeIndex(['2022-07-01 00:00:00', '2022-07-01 04:00:00',
               '2022-07-01 08:00:00', '2022-07-01 12:00:00',
               '2022-07-01 16:00:00', '2022-07-01 20:00:00',
               '2022-07-02 00:00:00'],
              dtype='datetime64[ns]', freq='240T')

reindex

Using date_range and reindex, we can fill in missing data. reindex assigns values where the index is set and assigns NaN where values are missing. You can specify the fill value for missing data using the fill_value parameter. In this case, we’ll fill with 0.

start_time = df['date'].tolist()[0]
end_time = df['date'].tolist()[-1]

time_list = pd.date_range(start_time, end_time, freq='1d')
df.set_index('date').reindex(time_list, fill_value=0)

	sales
2022-07-01	1
2022-07-02	2
2022-07-03	0
2022-07-04	0
2022-07-05	3
2022-07-06	4
2022-07-07	0
2022-07-08	0
2022-07-09	5

Missing data for 7/3 and 7/4, etc., has been filled with 0.

Although it doesn’t make much sense, you can reorder the data based on the list provided to reindex. By using time_list[::-1], you can reverse the order.

df.set_index('date').reindex(time_list[::-1], fill_value=0)

	sales
2022-07-09	5
2022-07-08	0
2022-07-07	0
2022-07-06	4
2022-07-05	3
2022-07-04	0
2022-07-03	0
2022-07-02	2
2022-07-01	1

Since both date_range and reindex are not used frequently, make sure to remember how to use them.

[python] Filling Missing Time Series Data Using pandas reindex and date_range