pandas 逐行删除列子集中的重复项，仅保留第一个副本，仅当存在3个或更多重复项时才逐行删除

bvjxkvbb 于 2023-02-27 发布在其他

关注(0)|答案(1)|浏览(125)

这是对我上一个问题的扩展，即逐行删除列子集中的重复项，仅保留第一个副本，这里我还有一个类似的问题，它具有不同的要求Drop duplicates in a subset of columns per row, rowwise, only keeping the first copy, rowwise only if every column has the same duplicate
我有下面的 Dataframe 。（实际的一个是大约700万行）

import pandas as pd

data = {'date': ['2023-02-22', '2023-02-21', '2023-02-23'],
        'x1': ['descx1a', 'descx1b', 'descx1c'],
        'x2': ['ALSFNHF950', 'KLUGUIF615', np.nan],
        'x3': [np.nan, np.nan, 24319.4],
        'x4': [np.nan, np.nan, 24334.15],
        'x5': [np.nan, np.nan, 24040.11],
        'x6': [np.nan, 75.51, 24220.34],
        'x7': [np.nan, np.nan, np.nan],
        'v': [np.nan, np.nan, np.nan],
        'y': [404.29, np.nan, np.nan],
        'ay': [np.nan, np.nan, np.nan],
        'by': [np.nan, np.nan, np.nan],
        'cy': [np.nan, np.nan, np.nan],
        'gy': [np.nan, np.nan, np.nan],
        'uap': [404.29, 75.33, np.nan],
        'ubp': [404.29, 75.33, np.nan],
        'sf': [np.nan, 2.0, np.nan]}

df = pd.DataFrame(data)

如果某个数字在x3，x4，x5，x6，x7，v，y，ay，by，cy，戈伊，uap，ubp列中有超过3个或更多的重复项，我希望删除重复项并只保留一个副本，即出现重复项的第一列或我可以选择的列（如果可能）。
输出应该如下所示，

data = {'date': ['2023-02-22', '2023-02-21', '2023-02-23'],
        'x1': ['descx1a', 'descx1b', 'descx1c'],
        'x2': ['ALSFNHF950', 'KLUGUIF615', np.nan],
        'x3': [np.nan, np.nan, 24319.4],
        'x4': [np.nan, np.nan, 24334.15],
        'x5': [np.nan, np.nan, 24040.11],
        'x6': [np.nan, 75.51, 24220.34],
        'x7': [np.nan, np.nan, np.nan],
        'v': [np.nan, np.nan, np.nan],
        'y': [404.29, np.nan, np.nan],
        'ay': [np.nan, np.nan, np.nan],
        'by': [np.nan, np.nan, np.nan],
        'cy': [np.nan, np.nan, np.nan],
        'gy': [np.nan, np.nan, np.nan],
        'uap': [np.nan, 75.33, np.nan],
        'ubp': [np.nan, 75.33, np.nan],
        'sf': [np.nan, 2.0, np.nan]}

第二行不应该受到影响，因为只有2个数字副本。
上一个问题有答案，

check = ['x3', 'x4', 'x5', 'x6', 'x7', 'v', 'y', 'ay', 'by', 'cy', 'gy', 'uap', 'ubp']
df.loc[:, check] = df.loc[:, check].mask(df.loc[:, check].apply(pd.Series.duplicated, axis=1))
print(df)

但如果我这么做，75.33中的一个就会被删除，这不是我想要的。
我在想，也许我可以每行做一个for循环，然后替换值，但是我有超过700万行的数据，有什么想法吗？

pandas

来源：https://stackoverflow.com/questions/75560780/drop-duplicates-in-a-subset-of-columns-per-row-rowwise-only-keeping-the-first

1条答案

按热度按时间

i1icjdpr1#

您可以堆叠数据，在那里处理重复项，然后将其反堆叠（透视）：

s = df.iloc[:,3:].stack().reset_index(name='value')
groups = s.groupby(['level_0','value'])
counts, cumcounts = groups['value'].transform('size'), groups.cumcount()

# verify your condition here - logic might not work as expected
s.loc[counts.ge(3) & (s['level_1'].eq('x6') | (s['level_1'].ne('x6') & cumcounts.gt(0))), 'value'] = np.nan
out = s.pivot(*s).reindex(columns=df.columns, index=df.index)

赞(0）回复(0）举报 2023-02-27

我来回答

pandas 逐行删除列子集中的重复项，仅保留第一个副本，仅当存在3个或更多重复项时才逐行删除

1条答案

相关问题

热门标签

最新问答