datetime 형식의 index로 설정된 주 수정

소개



시계열 데이터를 처리함에 따라 pandas에서 index를 datetime으로 하면 편리하지만, index에서 주를 추출할 때, 예를 들면 2019/12/30과 12/31 등 마지막 주가 2019년의 제1주가 된다 버리기 때문에 이것을 수정하는 방법을 조사한 비망록이다.

문제점



12월 마지막 주와 다음 해 1월 첫 주는 같은 주가 된다. 주 번호를 붙이는 방법은 여러 가지 있는 것 같지만, pandas에서는 ISO 준거의 유럽식과 같고, 그 주의 평일부터 제1주째로 인식된다.

참고 : 2019년 주간 캘린더 및 주 번호 목록


그러나, 2019/12/30, 31은 df.index.yaer -> 2019, df.index.week -> 1이 되기 때문에, 2019년의 제1주로 인식되어 버리기 때문에, 주 단위로의 데이터 집계 등의 처리에 형편이 나쁘다.

아래와 같은 datetime 형식의 index를 예로 나타낸다.
df.index
DatetimeIndex(['2019-12-29 22:00:00+00:00', '2019-12-29 22:00:00+00:00',
               '2019-12-29 22:00:00+00:00', '2019-12-29 22:00:00+00:00',
               '2019-12-29 22:00:00+00:00', '2019-12-29 22:08:00+00:00',
               '2019-12-29 23:00:00+00:00', '2019-12-30 01:47:00+00:00',
               '2019-12-30 02:48:00+00:00', '2019-12-30 02:48:00+00:00',
               '2019-12-30 12:34:00+00:00', '2019-12-30 14:51:00+00:00',
               '2019-12-30 14:53:00+00:00', '2019-12-30 14:56:00+00:00',
               '2019-12-31 04:50:00+00:00', '2019-12-31 13:41:00+00:00',
               '2019-12-31 14:42:00+00:00', '2019-12-31 14:45:00+00:00',
               '2019-12-31 15:56:00+00:00', '2019-12-31 15:56:00+00:00',
               '2019-12-31 15:58:00+00:00', '2019-12-31 15:58:00+00:00'],
              dtype='datetime64[ns, UTC]', name='date', freq=None)

이 데이터 프레임에 대해서, 아래와 같이 MultiIndex를 설정하면, 12/29는 52주인 것에 대해, 12/30, 31은 2019년의 제1주가 되어 버리는 것이 확인할 수 있다.
df_w = df.set_index([df.index.year, df.index.month,
                     df.index.week, df.index])
df_w.index.names = ['year', 'month', 'week', 'date']
df_w.sort_index(inplace=True)
df_w.index
MultiIndex([(2019, 12,  1, '2019-12-30 01:47:00+00:00'),
            (2019, 12,  1, '2019-12-30 02:48:00+00:00'),
            (2019, 12,  1, '2019-12-30 02:48:00+00:00'),
            (2019, 12,  1, '2019-12-30 12:34:00+00:00'),
            (2019, 12,  1, '2019-12-30 14:51:00+00:00'),
            (2019, 12,  1, '2019-12-30 14:53:00+00:00'),
            (2019, 12,  1, '2019-12-30 14:56:00+00:00'),
            (2019, 12,  1, '2019-12-31 04:50:00+00:00'),
            (2019, 12,  1, '2019-12-31 13:41:00+00:00'),
            (2019, 12,  1, '2019-12-31 14:42:00+00:00'),
            (2019, 12,  1, '2019-12-31 14:45:00+00:00'),
            (2019, 12,  1, '2019-12-31 15:56:00+00:00'),
            (2019, 12,  1, '2019-12-31 15:56:00+00:00'),
            (2019, 12,  1, '2019-12-31 15:58:00+00:00'),
            (2019, 12,  1, '2019-12-31 15:58:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:08:00+00:00'),
            (2019, 12, 52, '2019-12-29 23:00:00+00:00')],
           names=['year', 'month', 'week', 'date'])

개선책



참고 사이트에 나의 요구하는 해결책이 있어, 자신의 케이스에 들어맞았습니다.
1. 일단 index를 해제
2. dt.week에서 주 추출
3. 강제로 52주로 변경
4. index 재설정
df.reset_index(inplace=True)
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["week"] = df["date"].dt.week
df["week"] = df["date"].apply(
    lambda x: 52 if x.year == 2019 and x.month==12 and x.day in [30, 31] else x.week)
df_w = df.set_index([df["year"], df["month"],
                     df["week"], df["date"]])
df_w.index.names = ['year', 'month', 'week', 'date']
df_w.sort_index(inplace=True)
df_w.index
MultiIndex([(2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:00:00+00:00'),
            (2019, 12, 52, '2019-12-29 22:08:00+00:00'),
            (2019, 12, 52, '2019-12-29 23:00:00+00:00'),
            (2019, 12, 52, '2019-12-30 01:47:00+00:00'),
            (2019, 12, 52, '2019-12-30 02:48:00+00:00'),
            (2019, 12, 52, '2019-12-30 02:48:00+00:00'),
            (2019, 12, 52, '2019-12-30 12:34:00+00:00'),
            (2019, 12, 52, '2019-12-30 14:51:00+00:00'),
            (2019, 12, 52, '2019-12-30 14:53:00+00:00'),
            (2019, 12, 52, '2019-12-30 14:56:00+00:00'),
            (2019, 12, 52, '2019-12-31 04:50:00+00:00'),
            (2019, 12, 52, '2019-12-31 13:41:00+00:00'),
            (2019, 12, 52, '2019-12-31 14:42:00+00:00'),
            (2019, 12, 52, '2019-12-31 14:45:00+00:00'),
            (2019, 12, 52, '2019-12-31 15:56:00+00:00'),
            (2019, 12, 52, '2019-12-31 15:56:00+00:00'),
            (2019, 12, 52, '2019-12-31 15:58:00+00:00'),
            (2019, 12, 52, '2019-12-31 15:58:00+00:00')],
           names=['year', 'month', 'week', 'date'])

이를 통해 매년 주별 집계가 가능합니다.

참고 사이트



같은 고민에 대한 해결책이 매우 도움이 되었습니다. 감사!
Pandas - wrong week extracted week from date

좋은 웹페이지 즐겨찾기