使用另一个数据框架作为查找表替换Pandas数据框架中的值

wbgh16ku  于 2022-12-02  发布在  其他
关注(0)|答案(1)|浏览(154)

I'm looking to replace values in a Dataframe with the values in a second Dataframe by matching the values in the first Dataframe with the columns from the second Dataframe.
Example:

import numpy as np
import pandas as pd
dt_index = pd.to_datetime(['2003-05-01', '2003-05-02', '2003-05-03', '2003-05-04'])
df = pd.DataFrame({'A':[1,1,3,12], 'B':[12,1,3,3], 'C':[3,12,12,1]}, index = dt_index)
df2 = pd.DataFrame({1:[1.4,4.2,1.3,5.6], 12:[2.3,7.3,9.5,0.4], 3:[8.8,0.1,8.7,2.4], 4:[9.6,9.8,5.5,1.8]}, index = dt_index)
df =             
             A   B   C
2003-05-01   1  12   3
2003-05-02   1   1  12
2003-05-03   3   3  12
2003-05-04  12   3   1

df2 = 
             1    12   3    4 
2003-05-01  1.4  2.3  8.8  9.6
2003-05-02  4.2  7.3  0.1  9.8
2003-05-03  1.3  9.5  8.7  5.5
2003-05-04  5.6  0.4  2.4  1.8

Expected output:

expect = pd.DataFrame({'A':[1.4,4.2,8.7,0.4], 'B':[2.3,4.2,8.7,2.4], 'C':[8.8,7.3,9.5,5.6]}, index = dt_index)
expect = 
              A    B    C
2003-05-01  1.4  2.3  8.8
2003-05-02  4.2  4.2  7.3
2003-05-03  8.7  8.7  9.5
2003-05-04  0.4  2.4  5.6

Attempt:

X = df.copy()
for i in np.unique(df):
    X.mask(df == i, df2[i], axis=0, inplace=True)

My attempt seems to work but I'm not sure if it has any pitfalls and how it would scale as the sizes of the Dataframe increase.
Are there better or faster solutions?

EDIT:

After cottontail's helpful answer, I realised I've made an oversimplification in my example. The values in df and columns of df and df2 cannot be assumed to be sequential.
I've now modified the example to reflect that.

ht4b089n

ht4b089n1#

一种方法是使用stack()df2重新整形为Series,并且使用df中的值将reindex()重新整形为Series;使用unstack()将其重新整形回原始形状。

tmp = df2.stack().reindex(df.stack().droplevel(-1).items())
tmp.index = pd.MultiIndex.from_arrays([tmp.index.get_level_values(0), df.columns.tolist()*len(df)])
df = tmp.unstack()

另一种方法是迭代地创建形状像df2的虚拟 Dataframe ,将其乘以df2,将其缩减为Series(使用sum()),并将其分配给形状像df的空 Dataframe 。

X = pd.DataFrame().reindex_like(df)
df['dummy'] = 1

for c in X:
    X[c] = (
        df.groupby([df.index, c])['dummy'].size()
        .unstack(fill_value=0)
        .reindex(df2.columns, axis=1, fill_value=0)
        .mul(df2)
        .sum(1)
    )

相关问题