pandas 基于第二个DataFrame的公共列对一个DataFrame执行计算

lymnna71  于 2023-11-15  发布在  其他
关注(0)|答案(2)|浏览(111)

假设两个DataFrames具有相等的列但不同的行,即使示例长度相等。我能够找到两个DataFrame的共性。我想利用这种相似性对其中一个DataFrame执行计算,并将结果放入另一个DataFrame中。总之,我想更新的耐克产品的销售价格是10%高于阿迪达斯产品的约束,他们都在同一类别和大小。

percent = 1.10

data1 = {
    'item_id': [1001, 1002, 1003, 1004],
    'brand': ['Adidas', 'Adidas', 'Adidas', 'Adidas'],
    'category_id': [241, 241, 238, 717],
    'size': [8, 7.5, 9, 10],
    'cost_price': [8.02, 4.94, 1.49, 18.44],
    'unit_price': [12.89, 7.98, 2.44, 29.53]
}

data2 = {
    'item_id': [1005, 1006, 1007, 1008],
    'brand': ['Nike', 'Nike', 'Nike', 'Nike'],
    'category_id': [512, 241, 604, 717],
    'size': [7.5, 8, 9, 10],
    'cost_price': [35.90, 11.62, 15.03, 20.53],
    'unit_price': [48.14, 16.29, 21.09, 28.20]
}

adidas = pd.DataFrame(data1)
nike = pd.DataFrame(data2)

columns = ['category_id', 'size']
common = adidas [columns].merge(nike[columns])
common
   category_id  size
0          241   8.0
1          717  10.0

df = pd.concat([adidas, nike]).merge(common)
df
   item_id   brand  category_id  size  cost_price  unit_price
0     1001  Adidas          241   8.0        8.02       12.89
1     1006    Nike          241   8.0       11.62       16.29
2     1004  Adidas          717  10.0       18.44       29.53
3     1008    Nike          717  10.0       20.53       28.20

字符串
如果你是一本字典,我会说:

cat_match = []
for catid in data2['category_id']:
    try:
        index = data1['category_id'].index(catid)
        cat_match.append(index)
    except ValueError:
        pass

size_match = []
for size in data2['size']:
    try:
        index = data1['size'].index(size)
        size_match(index)
    except ValueError:
        pass

print(cat_match)
[0, 3]
print(size_match)
[1, 0, 2, 3]


然后,我将计算data1的“unit_cost”(又名销售价格),方法是在索引0和3处设置标记= unit_cost * percentage。接下来,我将在data2中搜索相同的索引,并创建一个名为“new_sell”的新行,其中包含计算出的标记,该标记将添加到data2字典中。
我也尝试了下面的代码从另一个SO答案,但我收到了一个错误。

df2['new_sell'] = df2.apply(lambda row: row['unit_price'] * 1.10 if row['category_id'] == df1['category_id'] and row['size'] == df1['size'] else None, axis=1)
...
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().


这两种尝试都相当混乱和复杂。有没有更简单的方法来实现以下使用pandas?

df2
   item_id   brand  category_id  size  cost_price  unit_price  sell_price
0     1006    Nike          241   8.0       11.62       16.29       14.18 # (12.89*1.1)
1     1008    Nike          717  10.0       20.53       28.20       32.48 # (29.53*1.1)

8tntrjer

8tntrjer1#

使用左-merge并乘以1.1(例如eval):

cols = ['category_id', 'size']
nike['sell_price'] = (nike[cols]
                      .merge(adidas[cols+ ['unit_price']],
                             on=cols, how='left')
                      .eval('unit_price*1.1') # or ['unit_price'].mul(1.1)
                     )

字符串
如果你不想修改nike,而是想要一个新的DataFrame:

cols = ['category_id', 'size']
out = (nike
       .merge(adidas[cols+ ['unit_price']]
              .rename(columns={'unit_price': 'sell_price'}),
              on=cols, how='left')
       .eval('sell_price=sell_price*1.1')
      )


输出量:

item_id brand  category_id  size  cost_price  unit_price  sell_price
0     1005  Nike          512   7.5       35.90       48.14         NaN
1     1006  Nike          241   8.0       11.62       16.29      14.179
2     1007  Nike          604   9.0       15.03       21.09         NaN
3     1008  Nike          717  10.0       20.53       28.20      32.483

3ks5zfa0

3ks5zfa02#

在这种情况下,我会使用多索引,这样你就可以循环使用类别/大小组合。这是我的解决方案:

shoes_df = pd.concat([adidas, nike])

shoes_df = shoes_df.set_index(["category_id", "size", "brand"])
shoes_df["new_sell"] = None

all_df = list()
all_brands = shoes_df.index.get_level_values("brand").unique()

for brand_index, brand_df in shoes_df.groupby(["category_id", "size"]):
    available_brands_in_group = brand_df.index.get_level_values("brand")
    missing_brands = all_brands.difference(available_brands_in_group)

    if not missing_brands.any():
        # both nike and adidas are available
        adidas_price = brand_df.loc[(slice(None), slice(None), "Adidas"), "unit_price"]
        nike_price_value = percent * adidas_price.values[0]
        brand_df.loc[(slice(None), slice(None), "Nike"), "new_sell"] = nike_price_value

    # collect all new brand_df with updated prices
    all_df.append(brand_df)

new_shoes_df = pd.concat(all_df)
print(new_shoes_df)

字符串
输出如下所示:

item_id  cost_price  unit_price new_sell
category_id size brand                                           
238         9.0  Adidas     1003        1.49        2.44     None
241         7.5  Adidas     1002        4.94        7.98     None
            8.0  Adidas     1001        8.02       12.89     None
                 Nike       1006       11.62       16.29   14.179
512         7.5  Nike       1005       35.90       48.14     None
604         9.0  Nike       1007       15.03       21.09     None
717         10.0 Adidas     1004       18.44       29.53     None
                 Nike       1008       20.53       28.20   32.483


如果你只想以同样的方式开始你可以做的耐克运动鞋:

nike_df = new_shoes_df.reset_index()
nike_df = nike_df[nike_df["brand"] == "Nike"]
nike_df = nike_df[adidas.columns.to_list() + ["new_sell"]].sort_values("item_id")


其产生:

item_id brand  category_id  size  cost_price  unit_price new_sell
4     1005  Nike          512   7.5       35.90       48.14     None
3     1006  Nike          241   8.0       11.62       16.29   14.179
5     1007  Nike          604   9.0       15.03       21.09     None
7     1008  Nike          717  10.0       20.53       28.20   32.483

相关问题