pandas 在python中更改基于月的年

u3r8eeie 于 2023-02-14 发布在 Python

关注(0)|答案(2)|浏览(114)

我有一个名为weather的数据集，它包含一列"Date"，如下所示。
| 日期|
| - ------|
| 2020年1月1日|
| 2020年1月2日|
| 2020年2月1日|
| 2020年2月4日|
| 2020年3月1日|
| 2020年4月1日|
| 2020年4月2日|
| 2020年4月3日|
| 2020年4月4日|
| 2020年5月1日|
| 2020年6月1日|
| 2020年7月1日|
| 2020年8月1日|
| 2020年9月1日|
| 2020年10月1日|
| 2020年11月1日|
| 2020年1月1日|
| 2020年2月1日|
| 2020年4月1日|
| 2020年5月1日|
| 2020年6月1日|
| 2020年7月1日|
| 2020年8月1日|
| 2020年9月1日|
| 2020年10月1日|
| 2020年11月1日|
| 2020年12月1日|
| 2020年1月1日|
问题是，应该是2020年、2021年和2022年的年份总是2020年。
所需的列如下所示
| 日期|
| - ------|
| 2020年1月1日|
| 2020年1月2日|
| 2020年2月1日|
| 2020年2月4日|
| 2020年3月1日|
| 2020年4月1日|
| 2020年4月2日|
| 2020年4月3日|
| 2020年4月4日|
| 2020年5月1日|
| 2020年6月1日|
| 2020年7月1日|
| 2020年8月1日|
| 2020年9月1日|
| 2020年10月1日|
| 2020年11月1日|
| 2021年1月1日|
| 二○二一年二月一日|
| 二○二一年四月一日|
| 二○二一年五月一日|
| 二○二一年六月一日|
| 2021年7月1日|
| 2021年8月1日|
| 2021年9月1日|
| 二○二一年十月一日|
| 二○二一年十一月一日|
| 二○二一年十二月一日|
| 2022年1月1日|
每年的最后一个月不一定是12日，但新的一年从01月开始。
下面是我的代码：

month = ['01','02','03','04','05','06','07','08','09','10','11','12']
for i in range(len(weather['Date'])):
    year = 2022
    for j in range(len(month)):
        if weather['Date'][i][5:7] == '01':
            weather['Date'][i] = weather['Date'][i].apply(lambda x: 'year' + x[5:])

有什么建议可以修复我的代码并获得所需的列吗？

pandas

来源：https://stackoverflow.com/questions/75430130/change-the-the-year-based-on-month-in-a-datafram-in-python

2条答案

按热度按时间

9rygscc11#

这里有一个方法：

使用pd.to_datetime并应用Series.diff和chain Series.dt.day，将Date列中的日期字符串转换为datetime。
由于Series中的每个 * 负 * 值（即"day"）都表示新的一年的开始，因此我们应用Series.lt(0)将所有低于0的值转换为True，其余值转换为False。
在这个阶段，我们将Series.cumsum链接起来，最终得到一个包含0, ..., 1, ..., 2的Series，这些值需要添加到年份2020中，以获得正确的年份。
现在，我们终于可以通过将(new_year = year + addition), month, day再次传递给pd.to_datetime来创建正确的日期了（参见SO answer）。

df['Date'] = pd.to_datetime(df['Date'])

df['Date'] = pd.to_datetime(dict(year=(df['Date'].dt.year 
                                       + df['Date'].diff().dt.days.lt(0).cumsum()), 
                                 month=df['Date'].dt.month, 
                                 day=df['Date'].dt.day))

df['Date']

0    2020-01-01
1    2020-01-02
2    2020-02-01
3    2020-02-04
4    2020-03-01
5    2020-04-01
6    2020-04-02
7    2020-04-03
8    2020-04-04
9    2020-05-01
10   2020-06-01
11   2020-07-01
12   2020-08-01
13   2020-09-01
14   2020-10-01
15   2020-11-01
16   2021-01-01
17   2021-02-01
18   2021-04-01
19   2021-05-01
20   2021-06-01
21   2021-07-01
22   2021-08-01
23   2021-09-01
24   2021-10-01
25   2021-11-01
26   2021-12-01
27   2022-01-01
Name: Date, dtype: datetime64[ns]

当然，转换成datetime并不需要 *，也可以重新创建日期字符串，去掉下面这行：

df['Date'].str[5:7].astype(int).diff().lt(0).cumsum()

赞(0）回复(0）举报 2023-02-14

osh3o9ms2#

类似于@ouroboros1，但是使用numpy来获得要添加到每个日期的年数，然后使用pd.offsets.DateOffset(years=...)来进行添加。

import numpy as np
import pandas as pd

df['Date'] = pd.to_datetime(df['Date'])
s = df['Date'].values
y = np.r_[0, (s[:-1] > s[1:]).cumsum()]

在这一点上，很容易做到：

df['Date'] += y * pd.offsets.DateOffset(years=1)

但我们会得到警告：PerformanceWarning: Adding/subtracting object-dtype array to DatetimeArray not vectorized.
因此，我们改为按要添加的年数分组，并将相关偏移添加到组中的所有日期。

def add_years(g):
    return g['Date'] + pd.offsets.DateOffset(years=g['y'].iloc[0])

df['Date'] = df.assign(y=y).groupby('y', sort=False, group_keys=False).apply(add_years)

这是相当快的（4.25毫秒对于1000行和10个不同的y值），并且，对于其他情况，比@ouroboros1的答案更一般：
1.它处理由于闰年引起的日期更改（在您的示例中没有出现，因为所有日期都是一个月的第一天，但是如果其中一个日期是'2020-02-29'，并且我们尝试使用构造dt = df['Date'].dt; pd.to_datetime(dict(year=dt.year + y, month=dt.month, ...)向其添加1年，则会得到ValueError: cannot assemble the datetimes: day is out of range for month）。
1.它保留一天中的任何时间和时区信息（同样，不是在您的情况下，但在一般情况下，人们会保留这些信息）。

赞(0）回复(0）举报 2023-02-14

我来回答

pandas 在python中更改基于月的年

2条答案

相关问题

热门标签

最新问答