pandas 逐行删除列子集中的重复项，仅保留第一个副本，仅当每个列都具有相同的重复项时才逐行删除

uz75evzq 于 2023-02-27 发布在其他

关注(0)|答案(2)|浏览(141)

这是我前面问题的另一个扩展，删除每行列子集中的重复项，只保留第一个副本，行和Drop duplicates in a subset of columns per row, rowwise, only keeping the first copy, rowwise only if there are 3 or more duplicates
我有以下 Dataframe （实际上大约有700万行）

import pandas as pd

data = {'date': ['2023-02-22', '2023-02-21', '2023-02-23'],
        'x1': ['descx1a', 'descx1b', 'descx1c'],
        'x2': ['ALSFNHF950', 'KLUGUIF615', np.nan],
        'x3': [np.nan, np.nan, 24319.4],
        'x4': [np.nan, np.nan, 24334.15],
        'x5': [np.nan, np.nan, 24040.11],
        'x6': [404.29, 75.21, 24220.34],
        'x7': [np.nan, np.nan, np.nan],
        'v': [np.nan, np.nan, np.nan],
        'y': [404.29, 75.33, np.nan],
        'ay': [np.nan, np.nan, np.nan],
        'by': [np.nan, np.nan, np.nan],
        'cy': [np.nan, np.nan, np.nan],
        'gy': [np.nan, np.nan, np.nan],
        'uap': [404.29, 75.33, np.nan],
        'ubp': [404.29, 75.33, np.nan],
        'sf': [np.nan, 2.0, np.nan]}

df = pd.DataFrame(data)

如果在我选择的列中有所有的重复项，当且仅当选择中的每个项都是重复项时，我希望删除重复项并只保留一个副本。
这意味着如果我的选择有4列，所有4列必须有相同的数字，它被视为重复。
如果4个选择中只有2个或3个重复，则不计数。
在上面的例子中，如果我的选择是['x6', 'y', 'uap', 'ubp']，输出应该是，

data = {'date': ['2023-02-22', '2023-02-21', '2023-02-23'],
        'x1': ['descx1a', 'descx1b', 'descx1c'],
        'x2': ['ALSFNHF950', 'KLUGUIF615', np.nan],
        'x3': [np.nan, np.nan, 24319.4],
        'x4': [np.nan, np.nan, 24334.15],
        'x5': [np.nan, np.nan, 24040.11],
        'x6': [404.29, 75.21, 24220.34],
        'x7': [np.nan, np.nan, np.nan],
        'v': [np.nan, np.nan, np.nan],
        'y': [np.nan, 75.33, np.nan],
        'ay': [np.nan, np.nan, np.nan],
        'by': [np.nan, np.nan, np.nan],
        'cy': [np.nan, np.nan, np.nan],
        'gy': [np.nan, np.nan, np.nan],
        'uap': [np.nan, 75.33, np.nan],
        'ubp': [np.nan, 75.33, np.nan],
        'sf': [np.nan, 2.0, np.nan]}

不应触摸第二行，因为其中一列不同。
我怎样才能做到这一点？

pandas

来源：https://stackoverflow.com/questions/75560915/drop-duplicates-in-a-subset-of-columns-per-row-rowwise-only-keeping-the-first

2条答案

按热度按时间

bvjxkvbb1#

如果要匹配所有重复项，可以使用：

selection = ['x6', 'y', 'uap', 'ubp']

# compare all values to the first one
m = df[selection].eq(df[selection[0]], axis=0)

# if all are duplicates, mask them except the first
df.loc[m.all(axis=1), selection[1:]] = np.nan

输出：

date       x1          x2       x3        x4        x5        x6  x7   v      y  ay  by  cy  gy    uap    ubp   sf
0  2023-02-22  descx1a  ALSFNHF950      NaN       NaN       NaN    404.29 NaN NaN    NaN NaN NaN NaN NaN    NaN    NaN  NaN
1  2023-02-21  descx1b  KLUGUIF615      NaN       NaN       NaN     75.21 NaN NaN  75.33 NaN NaN NaN NaN  75.33  75.33  2.0
2  2023-02-23  descx1c         NaN  24319.4  24334.15  24040.11  24220.34 NaN NaN    NaN NaN NaN NaN NaN    NaN    NaN  NaN

中间体：

m
     x6      y    uap    ubp
0  True   True   True   True  # all True = duplicate
1  True  False  False  False
2  True  False  False  False

m.all(axis=1)
0     True
1    False
2    False
dtype: bool

精度

请注意，如果您有浮点值，看起来相同的值可能不相等。在这种情况下，使用以下公式计算掩码可能更安全：

import numpy as np
m = np.isclose(df[selection], df[[selection[0]]])

赞(0）回复(0）举报 2023-02-27

eoigrqb62#

您可以：

selection = ['x6', 'y', 'uap', 'ubp']

#Here you see if all values across the selected columns are same
# if they are same the diff would be 0 in both directions and if you take all across columns it will be the row whose value should only be first value.
m = (df[selection].diff(axis='columns').eq(0) | 
     df[selection].diff(-1, axis='columns').eq(0)).all(1)

# Then select such rows you found by above mask and the columns other than the first one - assign them np.nan
df.loc[m, selection[1:]] = np.nan

赞(0）回复(0）举报 2023-02-27

我来回答

pandas 逐行删除列子集中的重复项，仅保留第一个副本，仅当每个列都具有相同的重复项时才逐行删除

2条答案

精度

相关问题

热门标签

最新问答