删除大于第i个重复观测pandas的观测[重复]

i7uaboj4 于 2023-05-12 发布在其他

关注(0)|答案(4)|浏览(106)

此问题已在此处有答案：

Pandas get topmost n records within each group（6个回答）
5天前关闭。
假设我有一个 Dataframe ，如

我想允许，比如说，100个重复的值a和b对，也就是说，有200对a=1和b=2，那么我想保留其中的100对。
我不能在GroupBy Dataframe 上使用duplicated，因此我不知道如何解决这个问题

pandas

来源：https://stackoverflow.com/questions/76190481/remove-the-the-obersvations-which-is-more-than-the-ith-duplicated-observation-p

4条答案

按热度按时间

2w3kk1z51#

# n: number of duplicates to keep
df.groupby(['a', 'b'], as_index=False).head(n)

赞(0）回复(0）举报 2023-05-12

yquaqz182#

我相信你可以这样做：

max_duplicates = 200
group_cols = ['a', 'b'] 

duplicates = df.duplicated(subset=group_cols, keep='first')

# get groups of duplicated rows subsets
groups = df[duplicates].groupby(group_cols)

# join rows without duplicates and allowed number of duplicated rows from each group 
df_clean = pd.concat([groups.head(max_duplicates), df[~duplicates]])

赞(0）回复(0）举报 2023-05-12

tzcvj98z3#

一个选项是按a、b分组。做一个cumcount，然后过滤。* 示例：*

要保留前3行，请执行以下操作：

df[df.groupby(['a', 'b']).cumcount() <= 2]
   a  b  c
0  1  2  1
1  1  2  2
2  1  2  3
4  2  2  1
5  2  2  2

赞(0）回复(0）举报 2023-05-12

q8l4jmvw4#

您可以通过将groupby方法与pandas中的head方法结合使用来实现这一点。这里有一个解决方案，只保留每对'a'和'b'的前100个副本：

import pandas as pd

# Your example DataFrame
data = {'a': [1, 1, 2, 2, 1], 'b': [2, 2, 3, 3, 2], 'c': [3, 3, 4, 4, 3]}
df = pd.DataFrame(data)

# Set the number of duplicates you want to keep
num_dups_to_keep = 100

# Group the DataFrame by columns 'a' and 'b', and keep only the first 'num_dups_to_keep' rows for each group
result = df.groupby(['a', 'b']).head(num_dups_to_keep)

# Reset the index
result = result.reset_index(drop=True)

print(result)

此代码段将按“a”和“b”列对DataFrame进行分组，然后只保留每组的前100行。如果特定对的重复项少于100个，它将保留所有重复项。

赞(0）回复(0）举报 2023-05-12

我来回答

删除大于第i个重复观测pandas的观测[重复]

4条答案

相关问题

热门标签

最新问答