Filling Missing Time Series Data Using pandas reindex and date_range
When analyzing sales data from an e-commerce site, there might be missing data for holidays. During analysis, it can be inconvenient to have missing dates, so we use pandas’ reindex and date_range to fill these gaps with a specific value, such as 0.
github
- The Jupyter notebook file is available on github here
google colaboratory
- To run on Google Colaboratory, use this link
Execution Environment
!sw_vers
ProductName: macOS
ProductVersion: 11.6.7
BuildVersion: 20G630
!python -V
Python 3.8.13
Prepare a DataFrame with missing time series data.
import pandas as pd
df = pd.DataFrame({
'sales': [i + 1 for i in range(5)],
'date': pd.to_datetime(['2022-07-01', '2022-07-02', '2022-07-05', '2022-07-06', '2022-07-09'])
})
df
sales | date | |
---|---|---|
0 | 1 | 2022-07-01 |
1 | 2 | 2022-07-02 |
2 | 3 | 2022-07-05 |
3 | 4 | 2022-07-06 |
4 | 5 | 2022-07-09 |
Data like this is common for stores that are closed on weekends or have specific days off. Missing dates in time series data can be inconvenient during analysis, so our goal is to fill these gaps.
date_range
pandas has a function called date_range that creates a series of continuous daily data. You can set the start and end dates and specify the frequency. For example, setting freq to ‘60min’ creates a datetime list at hourly intervals, and ‘240min’ creates one at 4-hour intervals.
pd.date_range('2022-07-01', '2022-07-02', freq='60min')
DatetimeIndex(['2022-07-01 00:00:00', '2022-07-01 01:00:00',
'2022-07-01 02:00:00', '2022-07-01 03:00:00',
'2022-07-01 04:00:00', '2022-07-01 05:00:00',
'2022-07-01 06:00:00', '2022-07-01 07:00:00',
'2022-07-01 08:00:00', '2022-07-01 09:00:00',
'2022-07-01 10:00:00', '2022-07-01 11:00:00',
'2022-07-01 12:00:00', '2022-07-01 13:00:00',
'2022-07-01 14:00:00', '2022-07-01 15:00:00',
'2022-07-01 16:00:00', '2022-07-01 17:00:00',
'2022-07-01 18:00:00', '2022-07-01 19:00:00',
'2022-07-01 20:00:00', '2022-07-01 21:00:00',
'2022-07-01 22:00:00', '2022-07-01 23:00:00',
'2022-07-02 00:00:00'],
dtype='datetime64[ns]', freq='60T')
pd.date_range('2022-07-01', '2022-07-02', freq='240min')
DatetimeIndex(['2022-07-01 00:00:00', '2022-07-01 04:00:00',
'2022-07-01 08:00:00', '2022-07-01 12:00:00',
'2022-07-01 16:00:00', '2022-07-01 20:00:00',
'2022-07-02 00:00:00'],
dtype='datetime64[ns]', freq='240T')
reindex
Using date_range and reindex, we can fill in missing data. reindex assigns values where the index is set and assigns NaN where values are missing. You can specify the fill value for missing data using the fill_value parameter. In this case, we’ll fill with 0.
start_time = df['date'].tolist()[0]
end_time = df['date'].tolist()[-1]
time_list = pd.date_range(start_time, end_time, freq='1d')
df.set_index('date').reindex(time_list, fill_value=0)
sales | |
---|---|
2022-07-01 | 1 |
2022-07-02 | 2 |
2022-07-03 | 0 |
2022-07-04 | 0 |
2022-07-05 | 3 |
2022-07-06 | 4 |
2022-07-07 | 0 |
2022-07-08 | 0 |
2022-07-09 | 5 |
Missing data for 7/3 and 7/4, etc., has been filled with 0.
Although it doesn’t make much sense, you can reorder the data based on the list provided to reindex. By using time_list[::-1], you can reverse the order.
df.set_index('date').reindex(time_list[::-1], fill_value=0)
sales | |
---|---|
2022-07-09 | 5 |
2022-07-08 | 0 |
2022-07-07 | 0 |
2022-07-06 | 4 |
2022-07-05 | 3 |
2022-07-04 | 0 |
2022-07-03 | 0 |
2022-07-02 | 2 |
2022-07-01 | 1 |
Since both date_range and reindex are not used frequently, make sure to remember how to use them.