按索引值对pandas数据集进行重复数据删除，而不使用`networkx`

pxq42qpu 于 2023-08-01 发布在其他

关注(0)|答案(6)|浏览(129)

请注意，我已经查看了此链接

Pandas and python: deduplication of dataset by several fields *

7月18日更新：我的观点是，所有这些解决方案都指向避免索引，直到所有重复数据消除都执行完毕。感谢所有到目前为止回复的人**

我希望每个id的值只有一个唯一的code字段值。

df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=[1,1,1,2,3,3])
df.index.name='id'

字符串
df：
| 代码| code |
| --| ------------ |
| 一个| A |
| 一个| A |
| B| B |
| C类| C |
| D级| D |
| 一个| A |
我想要的输出是：
| 代码| code |
| --| ------------ |
| 一个| A |
| B| B |
| C类| C |
| D级| D |
| 一个| A |
我设法做到这一点如下，* 但我不喜欢它 *。

i=df.index.name
df.reset_index().drop_duplicates().set_index(i)

型
原因如下：

如果索引没有名称，则此操作将失败
我不需要重新设置和设置索引
这是一个相当常见的操作，这里有太多的墨水。

我想说的是：
df.groupby('id').drop_duplicates()个
目前不支持。
有没有一个更Python的方法来做到这一点？

pandas

来源：https://stackoverflow.com/questions/76688653/deduplicate-pandas-dataset-by-index-value-without-using-networkx

6条答案

按热度按时间

rpppsulh1#

要使用.groupby有效地删除重复项，只需指定只保留第一行即可：

from pandas import DataFrame

df = DataFrame({"code": ["A", "A", "B", "C", "D", "A"]}, index=[1, 1, 1, 2, 3, 3])
deduped = df.groupby(by=["code", df.index]).head(1)
print(deduped)
#   code
# 1    A
# 1    B
# 2    C
# 3    D
# 3    A

字符串
这个答案是基于this answer的，它还提出了几个额外的替代方案。

赞(0）回复(0）举报 2023-08-01

7xzttuei2#

当你创建一个DataFrame时，将一个列表分配给一个索引，索引的名称将始终是None，一个对象。唯一一次索引的名称将不同的情况是，如果你将一个pd.Series对象分配给一个索引，其名称与“index"不同。

df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=[1,1,1,2,3,3])
print(df.index.name) # -> 'None'

# You need to specify name otherwise it will default to None, <class NoneType>
index = pd.Series(data=[1,1,1,2,3,3], name='INDEX_NAME')
df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=index)
print(df.index.name) # -> 'INDEX_NAME'

字符串
现在回到您的问题，当您从csv创建DataFrame时，您需要指定一个index_col，如果index_col有一个名称，那么它就是索引名称。在csv中可能没有名称，只有一个空字符串，那么它将没有名称，它将是None。如果不指定'index_col'，将再次没有名称，它将是None，并且None不是字符串，它是<class 'NoneType'> '
范例：

csv_string = ',A,B,C\n0,1,2,3\n1,4,5,6\n2,7,8,9'

# Without specifying 'index_col' parameter
df = pd.read_csv(io.StringIO(csv_string))
print(df)
'''
Output:

   Unnamed: 0  A  B  C
0           0  1  2  3
1           1  4  5  6
2           2  7  8  9
'''
print(type(df.index.name)) # <class 'NoneType'>

# By specifying index_col
df = pd.read_csv(io.StringIO(csv_string), index_col=0)
print(df)
'''
Output:

   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9
'''
print(type(df.index.name)) # <class 'NoneType'>
# This is because in the first column, on the first row, there is an empty string

# Let's change that to a non-empty string
csv_string = 'index,A,B,C\n0,1,2,3\n1,4,5,6\n2,7,8,9'

df = pd.read_csv(io.StringIO(csv_string), index_col=0)
print(df)
'''
Output:

       A  B  C
index         
0      1  2  3
1      4  5  6
2      7  8  9
'''
print(df.index.name, type(df.index.name)) # index <class 'str'>

型
当您像以前一样创建DataFrame时，或者像我展示的示例一样，您将始终知道索引的名称。

没有索引名的情况下怎么做：
*第一种方法（可能是最好的）

index = pd.Series(data=[1,1,1,2,3,3])
df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=index)

modified_df = df.reset_index().drop_duplicates(['index', 'code']).set_index('index')

型
与您的类似，因为如果没有名称，.reset_index（）方法会将列命名为“index”。还有inplace参数，以防你想修改原始变量df而不是返回副本。

*第二种方法

index = pd.Series(data=[1,1,1,2,3,3])
df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=index)

modified_df = df.reset_index().drop_duplicates(['index', 'code'])
modified_df.index = modified_df['index']
modified_df = modified_df.drop(columns=['index'])

型
类似地，.drop()方法有一个inplace参数，以防你想修改原始的。如果inplace为true，则返回None，否则返回copy，所以在使用inplace参数时不应将返回值赋给任何东西。

**注意：**根据需要修改DataFrame后，df.index.name即使原来没有名称，也会有一个名称，为index。如果不需要索引名，可以自由地为索引名分配“无”值。

赞(0）回复(0）举报 2023-08-01

xuo3flqw3#

要添加到您当前的方法中，请执行以下操作：
1.未命名的索引将转换为reset_index后面的列名“index”
1.第二步，可以将索引设置为第一列
下面是一个示例：

df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=[1,1,1,2,3,3])
df.reset_index()
df = df[~df.duplicated(keep="first")]
df = df.set_index(df.iloc[:,0])
#....or
df = df.set_index(df.columns.to_list[0])

字符串

赞(0）回复(0）举报 2023-08-01

nhn9ugyo4#

这是@luzede提供的第一个选项的简短版本：

from pandas import DataFrame

df = DataFrame({"id": [1, 1, 1, 2, 3, 3], "code": ["A", "A", "B", "C", "D", "A"]})
deduped = df.drop_duplicates(subset=["id", "code"])
print(deduped)
#   code  id
# 0    A   1
# 2    B   1
# 3    C   2
# 4    D   3
# 5    A   3

字符串
请注意，为了简单起见，dataframe是使用“id”作为单独的列来构造的（这产生了与问题中代码片段中基于索引的方法相同的结果）。

赞(0）回复(0）举报 2023-08-01

fwzugrvs5#

这里有一种方法可以实现结果：

df = pd.DataFrame({'code':['A','A','B','C','D','A']},index=[1,1,1,2,3,3])
df['id'] = df.reset_index()['index'].values
print(df)
#     code  id
#1    A   1
#1    A   1
#1    B   1
#2    C   2
#3    D   3
#3    A   3

output = df.groupby(by=['code','id']).max().reset_index()
print(output)

#  code  id
#0    A   1
#1    A   3
#2    B   1
#3    C   2
#4    D   3

字符串

赞(0）回复(0）举报 2023-08-01

tpgth1q76#

import pandas as pd

df = pd.DataFrame({'code': ['A', 'A', 'B', 'C', 'D', 'A']}, index=[1, 1, 1, 2, 3, 3])
df.index.name = 'id'
df = df.drop_duplicates(keep='first')

print(df)

字符串

赞(0）回复(0）举报 2023-08-01

我来回答

按索引值对pandas数据集进行重复数据删除，而不使用`networkx`

6条答案

相关问题

热门标签

最新问答