pandas动态日期过滤器和计算3个新列的基础上过去

ou6hu8tu  于 2023-04-28  发布在  其他
关注(0)|答案(1)|浏览(101)

我有两个dataframe,如下所示

transaction_df = pd.DataFrame({'cust_name': ['ABC','ABC','ABC','ABC','ABC','ABC'],
                        'partner':['A','A','A','B','C','C'],
                        'part_no':['P1','P2','P1','P2','P3','P3'],
                        'transaction_date':['21/05/2021','21/05/2022','21/01/2023','21/09/2020','11/05/2022','18/05/2022'],
                        'qty':[100,100,600,150,320,410]})
transaction_df['transaction_date'] = pd.to_datetime(data_df['transaction_date'])

# Create proj_df
proj_df = pd.DataFrame({'proj_id':[1,2,3,4],
                        'part_no':['P1','P2','P3','P4'],
                        'partner':['A','A','F','C'],
                        'last_purchase_date':['11/07/2022','19/09/2021','20/04/2023','27/09/2020'],
                        'cust_name': ['ABC','ABC','ABC','ABC']})
proj_df['last_purchase_date'] = pd.to_datetime(proj_df['last_purchase_date'])

我的目标是做到以下几点
a)在proj_df中创建3个新列-cust_total_transcust_part_total_transcust_part_partner_total_trans
此列中的每一列都基于以下逻辑进行计算
cust_total_trans-计算在proj_df中为每个proj_id提供的last_purchase_date之前每个客户的交易总数
cust_part_total_trans-计算在proj_df中为每个proj_id提供的last_purchase_date之前每个客户和部件组合的事务处理总数
cust_part_partner_total_trans-计算在proj_df中为每个proj_id提供的last_purchase_date之前,每个客户、部件和合作伙伴组合的交易总数
所以,我尝试了下面的

merged_df = transaction_df.merge(proj_df, on="cust_name")

last_purchase_dates = merged_df.groupby("proj_id")["last_purchase_date"].max()

# Compute cust_total_transactions
cust_total_trans = merged_df.groupby("cust_name").size().reset_index(name="cust_total_trans")

# Compute cust_part_total_trans
cust_part_total_trans = merged_df.groupby(["cust_name", "part_no_x"]).size().reset_index(name="cust_part_total_trans")

# Compute cust_part_partner_total_trans
cust_part_partner_total_trans = merged_df.groupby(["cust_name", "part_no_x", "partner_x"]).size().reset_index(name="cust_part_partner_total_trans")

但是这并不能帮助我得到预期的输出。我如何应用before last purchase date标准。这意味着,我只需要计算它们是否在项目的最后购买日期之前。
我希望我的输出像下面这样。按日期列的升序排序。我的真实的交易数据是10万行的大数据,项目数据是10 K行。所以,任何高效和优雅的解决方案都是有帮助的

tf7tbtn2

tf7tbtn21#

我有个办法

import pandas as pd
import numpy as np

transaction_df = pd.DataFrame({'cust_name': ['ABC','ABC','ABC','ABC','ABC','ABC'],
                        'partner':['A','A','A','B','C','C'],
                        'part_no':['P1','P2','P1','P2','P3','P3'],
                        'transaction_date':['21/05/2021','21/05/2022','21/01/2023','21/09/2020','11/05/2022','18/05/2022'],
                        'qty':[100,100,600,150,320,410]})
transaction_df['transaction_date'] = pd.to_datetime(transaction_df['transaction_date'], dayfirst=True)

# Create proj_df
proj_df = pd.DataFrame({'proj_id':[1,2,3,4],
                        'part_no':['P1','P2','P3','P4'],
                        'partner':['A','A','F','C'],
                        'last_purchase_date':['11/07/2022','19/09/2021','20/04/2023','27/09/2020'],
                        'cust_name': ['ABC','ABC','ABC','ABC']})
proj_df['last_purchase_date'] = pd.to_datetime(proj_df['last_purchase_date'], dayfirst=True)

merge = pd.merge(proj_df, transaction_df, on='cust_name')
merge = merge[merge['last_purchase_date']>=merge['transaction_date']]
#setting index is necessary for value_counts() to match correctly
proj_df = proj_df.set_index('proj_id')
proj_df['cust_total_trans'] = merge['proj_id'].value_counts()

proj_df = proj_df.reset_index()
merge = pd.merge(proj_df, transaction_df, on=['cust_name', 'part_no'])
merge = merge[merge['last_purchase_date']>=merge['transaction_date']]
#setting index is necessary for value_counts() to match correctly
proj_df = proj_df.set_index('proj_id')
proj_df['cust_part_total_trans'] = merge['proj_id'].value_counts()
proj_df['cust_part_total_trans'] = proj_df['cust_part_total_trans'].replace(np.NaN, 0).astype(int)

proj_df = proj_df.reset_index()
merge = pd.merge(proj_df, transaction_df, on=['cust_name', 'part_no', 'partner'])
merge = merge[merge['last_purchase_date']>=merge['transaction_date']]
#setting index is necessary for value_counts() to match correctly
proj_df = proj_df.set_index('proj_id')
proj_df['cust_part_partner_total_trans'] = merge['proj_id'].value_counts()
proj_df['cust_part_partner_total_trans'] = proj_df['cust_part_partner_total_trans'].replace(np.NaN, 0).astype(int)
proj_df.sort_index()

正如你所看到的,这三个任务的代码非常相似。我们只是在merge/join键中添加了更多的键来解释“for each X,Y,Z”。
由于合并给了我们日期,我们可以过滤只保留符合条件的日期。

相关问题