Pandas在 Dataframe 中每隔n行递增一次,条件和groupby

ndasle7k  于 2023-03-11  发布在  其他
关注(0)|答案(2)|浏览(153)

这是我当前 Dataframe (df 1)的一个简化示例,在将嵌套循环计算的结果合并回客户数据之后。我的实际 Dataframe 是1000万行以上,所以我处理的是大数据,我更喜欢最快的方法。

df1 = pd.DataFrame({"id": ['z111','z111','z111','z111','z112','z112','z112','z112'], #customer data
                    "calc_amt": [1000,500,200,300,100,50,30,200],
                    "month_end":['28-02-2023','28-02-2023','28-02-2023','28-02-2023','28-02-2023','28-02-2023','28-02-2023','28-02-2023'],
                    "period":[2,2,2,2,6,6,6,6],})

我试图找到最有效的方法来完成以下工作;按每个用户id,id喜欢,
在列名period中每隔3行递增1在列名month_end中每隔3行递增1(到下一个month_end日期)添加一个计算编号列以标记计算(calc_num)。
我的预期输出(df 2)

df2 = pd.DataFrame({"id": ['z111','z111','z111','z111','z112','z112','z112','z112'], #customer data
                    "calc_amt": [1000,500,200,300,100,50,30,200],
                    "month_end":['28-02-2023','28-02-2023','31-03-2023','31-03-2023','28-02-2023','28-02-2023','31-03-2023','31-03-2023'],
                    "period":[2,2,3,3,6,6,7,7],
                    "calc_num":[1,2,1,2,1,2,1,2],})
zbq4xfa0

zbq4xfa01#

您可以使用groupby.cumcount来枚举每组的行,然后使用modulo或floor除法:

N = 2 # periodicity

# ensure datetime
df1['month_end'] = pd.to_datetime(df1['month_end'])

# enumerate rows
c = df1.groupby('id').cumcount()

df1['period'] += c.floordiv(N)
df1['calc_num'] = c.mod(N).add(1)
df1['month_end'] += np.array([pd.offsets.MonthEnd(x) for x in c.floordiv(2)])
  • 注意:如果要创建新的 Dataframe ,请首先运行df2 = df1.copy(),然后使用df2。*

输出:

id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z111       400 2023-04-30       4         1
5  z111       450 2023-04-30       4         2
6  z112       100 2023-02-28       6         1
7  z112        50 2023-02-28       6         2
8  z112        30 2023-03-31       7         1
9  z112       200 2023-03-31       7         2
q3qa4bjr

q3qa4bjr2#

使用GroupBy.cumcount作为计数器,整数和模除以2,最后加上下个月,并通过Serie.dt.to_period转换为月周期:

df1['month_end'] = pd.to_datetime(df1['month_end'])

g = df1.groupby('id').cumcount()

df2 = df1.assign(period = df1['period'] + g // 2,
                 calc_num = g % 2 + 1,
                 month_end = (df1['month_end'].dt.to_period('m') + 
                              g // 2).dt.to_timestamp(how='e').dt.normalize())

print (df2)
     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z112       100 2023-02-28       6         1
5  z112        50 2023-02-28       6         2
6  z112        30 2023-03-31       7         1
7  z112       200 2023-03-31       7         2

或者对offsets.MonthEnd使用lsit理解:

df1['month_end'] = pd.to_datetime(df1['month_end'])

g = df1.groupby('id').cumcount()

df2 = df1.assign(period = df1['period'] + g // 2,
                 calc_num = g % 2 + 1,
                 month_end = [x + pd.offsets.MonthEnd(y) for x , y 
                              in zip(df1['month_end'], g // 2)])

print (df2)
     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z112       100 2023-02-28       6         1
5  z112        50 2023-02-28       6         2
6  z112        30 2023-03-31       7         1
7  z112       200 2023-03-31       7         2

由于使用的是大型DataFrame,因此下面是有效添加月份的技巧-技巧是将整数除以2,然后减去一天,再添加下一个月:

df1['month_end'] = pd.to_datetime(df1['month_end'])

g = df1.groupby('id').cumcount()

df2 = df1.assign(period = df1['period'] + g // 2,
                 calc_num = g % 2 + 1,
                 month_end = df1['month_end'].values.astype('datetime64[M]') + 
                             np.array(g.to_numpy() // 2 + 1, dtype='timedelta64[M]') - 
                             np.array([1], dtype='timedelta64[D]')
                             )

print (df2)
     id  calc_amt  month_end  period  calc_num
0  z111      1000 2023-02-28       2         1
1  z111       500 2023-02-28       2         2
2  z111       200 2023-03-31       3         1
3  z111       300 2023-03-31       3         2
4  z112       100 2023-02-28       6         1
5  z112        50 2023-02-28       6         2
6  z112        30 2023-03-31       7         1
7  z112       200 2023-03-31       7         2

相关问题