numpy 如何比较两个大型 Dataframe 来检查新条目?

ogq8wdun  于 2023-06-29  发布在  其他
关注(0)|答案(5)|浏览(118)

我有两个pandas Dataframe (2020年上半年和2020年下半年)与客户,以及他们购买的产品。有没有一种有效的方法来找出哪些顾客在第二部分买了什么,哪些顾客在第一部分没有买?
H1_2020:
| 产品展示| Product |
| --| ------------ |
| A| A |
| B| B |
| A| A |
| A| A |
h2_2020:
| 产品展示| Product |
| --| ------------ |
| A| A |
| B| B |
| C| C |
| A| A |
因此,基于上述两个表的输出应为:
| 买了一个新的东西在半年2?||
| --| ------------ |
| 0| 0 |
| 1| 1 |
| 1| 1 |

hl0ma9xz

hl0ma9xz1#

使用pd.merge。如果您在h2_2020中有新客户,他们将出现在最终输出中:

>>> pd.merge(h1_2020, h2_2020, on=['Customer', 'Product'], indicator=True, how='right') \
      .query("_merge == 'right_only'") \
      .value_counts('Customer') \
      .reindex(set(h1_2020['Customer'].unique()).union(h2_2020['Customer'].unique()), fill_value=0)

Customer
1    0
2    1
3    1
dtype: int64
a64a0gku

a64a0gku2#

您可以使用设置的差异来比较这些值

# dfh1: Dataframe for h1 and dfh2 is the DataFrame for h2
   product_by_customer = dict()
   customers = df_h1['Customer'].unique()
   for customer in customers:
       products_in_h1 = set(dfh1[dfh1['customer']==customer]['product'].values)
       products_in_h2 = set(dfh2[dfh2['customer']==customer]['product'].values)
       product_by_customer[customer] = list(products_in_h2.difference(products_in_h1))
   print(product_by_customer)
wmomyfyw

wmomyfyw3#

加入并检查哪个记录仅来自h2_2020

h1h2 = h1_2020.merge(h2_2020, how='outer', indicator=True)

#   Customer Product      _merge
#0         1       A        both
#1         1       B   left_only
#2         2       A   left_only
#3         3       A        both
#4         2       B  right_only
#5         3       C  right_only

(h1h2.groupby('Customer')['_merge']
     .agg(lambda g: g.eq('right_only').sum())
     .rename('Newly Bought')
     .reset_index())

#   Customer  Newly Bought
#0         1             0
#1         2             1
#2         3             1
r55awzrz

r55awzrz4#

如果两个数据集的客户集相同,则可以执行以下操作:

h2_2020 = h2_2020.drop_duplicates()
h1_2020 = h1_2020.drop_duplicates()

h2_2020.groupby('Customer').apply(lambda x: len(x)) > h1_2020.groupby('Customer').apply(lambda x: len(x))
4ngedf3f

4ngedf3f5#

我更改了dataframe h2_2020,添加了客户4,以防新客户进入

# module

import pandas as pd
import numpy as np

# make a dataset

h1_2020={'customer':[1,1,2,3],
         'product':'A B A A'.split()
         }

h2_2020={'customer':[1,2,3,3,4],
         'product':'A B C A A'.split()
         }

h1_2020=pd.DataFrame(h1_2020)
print(h1_2020)
h2_2020=pd.DataFrame(h2_2020)
print(h2_2020)

所以我结果有其他列指示客户是否是半2中的新客户。

# drop_duplicates

dh1_2020=h1_2020.drop_duplicates()
dh2_2020=h2_2020.drop_duplicates()

# append and drop_duplicates to see what products customers bought during year

a=dh2_2020.append(dh1_2020).drop_duplicates().sort_values(['customer','product'])
print(a)

# check whether customers bought new product in half 2 

check=a.merge(dh1_2020, on=['customer','product'],indicator=True,how='outer')
print(check) 
# => 'left_only' means that the customer bought new product in half 2, but customer 4 is new customer in half 2

'''
#make a dummy column indicating a customer bought new product or not 
and the other dummy column indicating if a customer is new customer in half 2 or not

'''
m1=pd.DataFrame(dh1_2020.customer.drop_duplicates())
print(m1)
m2=check.loc[check['_merge']=='left_only',['customer','_merge']].drop_duplicates()
print(m2)

whatwewant=m1.merge(m2,on='customer',how='outer', indicator='new_customer_hf2')
whatwewant['_merge']=1*(whatwewant['_merge']=='left_only')
whatwewant['new_customer_hf2']=1*(whatwewant['new_customer_hf2']=='right_only')
print(whatwewant)

#   customer  _merge  new_customer_hf2
#0         1       0                 0
#1         2       1                 0
#2         3       1                 0
#3         4       1                 1

相关问题