pandas 逐行删除列子集中的重复项,仅保留第一个副本,仅当每个列都具有相同的重复项时才逐行删除

uz75evzq  于 2023-02-27  发布在  其他
关注(0)|答案(2)|浏览(140)

这是我前面问题的另一个扩展,删除每行列子集中的重复项,只保留第一个副本,行和Drop duplicates in a subset of columns per row, rowwise, only keeping the first copy, rowwise only if there are 3 or more duplicates
我有以下 Dataframe (实际上大约有700万行)

import pandas as pd

data = {'date': ['2023-02-22', '2023-02-21', '2023-02-23'],
        'x1': ['descx1a', 'descx1b', 'descx1c'],
        'x2': ['ALSFNHF950', 'KLUGUIF615', np.nan],
        'x3': [np.nan, np.nan, 24319.4],
        'x4': [np.nan, np.nan, 24334.15],
        'x5': [np.nan, np.nan, 24040.11],
        'x6': [404.29, 75.21, 24220.34],
        'x7': [np.nan, np.nan, np.nan],
        'v': [np.nan, np.nan, np.nan],
        'y': [404.29, 75.33, np.nan],
        'ay': [np.nan, np.nan, np.nan],
        'by': [np.nan, np.nan, np.nan],
        'cy': [np.nan, np.nan, np.nan],
        'gy': [np.nan, np.nan, np.nan],
        'uap': [404.29, 75.33, np.nan],
        'ubp': [404.29, 75.33, np.nan],
        'sf': [np.nan, 2.0, np.nan]}

df = pd.DataFrame(data)

如果在我选择的列中有所有的重复项,当且仅当选择中的每个项都是重复项时,我希望删除重复项并只保留一个副本。
这意味着如果我的选择有4列,所有4列必须有相同的数字,它被视为重复。
如果4个选择中只有2个或3个重复,则不计数。
在上面的例子中,如果我的选择是['x6', 'y', 'uap', 'ubp'],输出应该是,

data = {'date': ['2023-02-22', '2023-02-21', '2023-02-23'],
        'x1': ['descx1a', 'descx1b', 'descx1c'],
        'x2': ['ALSFNHF950', 'KLUGUIF615', np.nan],
        'x3': [np.nan, np.nan, 24319.4],
        'x4': [np.nan, np.nan, 24334.15],
        'x5': [np.nan, np.nan, 24040.11],
        'x6': [404.29, 75.21, 24220.34],
        'x7': [np.nan, np.nan, np.nan],
        'v': [np.nan, np.nan, np.nan],
        'y': [np.nan, 75.33, np.nan],
        'ay': [np.nan, np.nan, np.nan],
        'by': [np.nan, np.nan, np.nan],
        'cy': [np.nan, np.nan, np.nan],
        'gy': [np.nan, np.nan, np.nan],
        'uap': [np.nan, 75.33, np.nan],
        'ubp': [np.nan, 75.33, np.nan],
        'sf': [np.nan, 2.0, np.nan]}

不应触摸第二行,因为其中一列不同。
我怎样才能做到这一点?

bvjxkvbb

bvjxkvbb1#

如果要匹配所有重复项,可以使用:

selection = ['x6', 'y', 'uap', 'ubp']

# compare all values to the first one
m = df[selection].eq(df[selection[0]], axis=0)

# if all are duplicates, mask them except the first
df.loc[m.all(axis=1), selection[1:]] = np.nan

输出:

date       x1          x2       x3        x4        x5        x6  x7   v      y  ay  by  cy  gy    uap    ubp   sf
0  2023-02-22  descx1a  ALSFNHF950      NaN       NaN       NaN    404.29 NaN NaN    NaN NaN NaN NaN NaN    NaN    NaN  NaN
1  2023-02-21  descx1b  KLUGUIF615      NaN       NaN       NaN     75.21 NaN NaN  75.33 NaN NaN NaN NaN  75.33  75.33  2.0
2  2023-02-23  descx1c         NaN  24319.4  24334.15  24040.11  24220.34 NaN NaN    NaN NaN NaN NaN NaN    NaN    NaN  NaN

中间体:

m
     x6      y    uap    ubp
0  True   True   True   True  # all True = duplicate
1  True  False  False  False
2  True  False  False  False

m.all(axis=1)
0     True
1    False
2    False
dtype: bool

精度

请注意,如果您有浮点值,看起来相同的值可能不相等。在这种情况下,使用以下公式计算掩码可能更安全:

import numpy as np
m = np.isclose(df[selection], df[[selection[0]]])
eoigrqb6

eoigrqb62#

您可以:

selection = ['x6', 'y', 'uap', 'ubp']

#Here you see if all values across the selected columns are same
# if they are same the diff would be 0 in both directions and if you take all across columns it will be the row whose value should only be first value.
m = (df[selection].diff(axis='columns').eq(0) | 
     df[selection].diff(-1, axis='columns').eq(0)).all(1)

# Then select such rows you found by above mask and the columns other than the first one - assign them np.nan
df.loc[m, selection[1:]] = np.nan

相关问题