如何在Python中基于多个属性有效地过滤掉列表中的重复对象？

xriantvc 于 2023-05-05 发布在 Python

关注(0)|答案(4)|浏览(113)

我正在处理一个Python项目，其中有一个自定义对象列表，我需要根据这些对象的多个属性过滤掉重复的对象。每个对象都有三个属性：id、name和timestamp。如果id和name属性都匹配列表中的另一个对象，我想将该对象视为重复对象。确定重复项时不应考虑timestamp属性。
下面是一个自定义对象类的示例：

class CustomObject:
    def __init__(self, id, name, timestamp):
        self.id = id
        self.name = name
        self.timestamp = timestamp

和对象的示例列表：

data = [
    CustomObject(1, "Alice", "2023-01-01"),
    CustomObject(2, "Bob", "2023-01-02"),
    CustomObject(1, "Alice", "2023-01-03"),
    CustomObject(3, "Eve", "2023-01-04"),
    CustomObject(2, "Bob", "2023-01-05"),
]

在本例中，我希望删除重复项，并保留最早的timestamp对象。
预期输出应为：

[
    CustomObject(1, "Alice", "2023-01-01"),
    CustomObject(2, "Bob", "2023-01-02"),
    CustomObject(3, "Eve", "2023-01-04"),
]

我知道我可以使用一个循环来比较列表中的每个对象和其他对象，但我担心性能，特别是当列表变大时。在Python中是否有更有效的方法来实现这一点，可能使用内置函数或库？

python

来源：https://stackoverflow.com/questions/76158635/how-to-efficiently-filter-out-duplicate-objects-in-a-list-based-on-multiple-prop

4条答案

按热度按时间

q43xntqr1#

class CustomObject:
    def __init__(self, id, name, timestamp):
        self.id = id
        self.name = name
        self.timestamp = timestamp
import pandas as pd

data2 = [
    CustomObject(1, "Alice", "2023-01-01"),
    CustomObject(2, "Bob", "2023-01-02"),
    CustomObject(1, "Alice", "2023-01-03"),
    CustomObject(3, "Eve", "2023-01-04"),
    CustomObject(2, "Bob", "2023-01-05"),
]

df = pd.DataFrame(([vars(f) for f in data2]))

df.sort_values(['id', 'name', 'timestamp'])
df.drop_duplicates(subset=['id', 'name'], keep='first', inplace=True)
print(df)

id   name   timestamp
0   1  Alice  2023-01-01
1   2    Bob  2023-01-02
3   3    Eve  2023-01-04

赞(0）回复(0）举报 2023-05-05

h9vpoimq2#

您可以使用字典来跟踪基于id和name属性的唯一对象，并在发现具有早期timestamp的对象时更新timestamp。这里有一个解决方案，应该比使用嵌套循环更有效：

class CustomObject:
    def __init__(self, id, name, timestamp):
        self.id = id
        self.name = name
        self.timestamp = timestamp

    def __repr__(self):
        return f"CustomObject({self.id}, {self.name}, {self.timestamp})"

data = [
    CustomObject(1, "Alice", "2023-01-01"),
    CustomObject(2, "Bob", "2023-01-02"),
    CustomObject(1, "Alice", "2023-01-03"),
    CustomObject(3, "Eve", "2023-01-04"),
    CustomObject(2, "Bob", "2023-01-05"),
]

unique_objects = {}
for obj in data:
    key = (obj.id, obj.name)
    if key not in unique_objects or obj.timestamp < unique_objects[key].timestamp:
        unique_objects[key] = obj

filtered_data = list(unique_objects.values())

print(filtered_data)
# Output: [CustomObject(1, Alice, 2023-01-01), CustomObject(2, Bob, 2023-01-02), CustomObject(3, Eve, 2023-01-04)]

赞(0）回复(0）举报 2023-05-05

ibps3vxo3#

对类做一些修改，使其在set中可用：

class CustomObject:
    def __init__(self, id, name, timestamp):
        self.id = id
        self.name = name
        self.timestamp = timestamp
    def __eq__(self, other):
        return self.id == other.id
    def __hash__(self):
        return hash(self.name)

现在你可以从列表中创建一个set：

set(data)

如果数据尚未按日期排序，则需要先按日期排序。

赞(0）回复(0）举报 2023-05-05

g52tjvyc4#

如果在类中实现__gt__，可以使代码更简洁。对象之间的比较基于时间戳值，该时间戳值被假定为YYYY-MM-DD格式。这将不适用于其他日期时间格式，因为它只是一个词汇比较：

class CustomObject:
    def __init__(self, _id, name, timestamp):
        self._id = _id
        self._name = name
        self._timestamp = timestamp
    def key(self):
        return self._id, self._name
    def __gt__(self, other):
        return isinstance(other, type(self)) and self._timestamp > other._timestamp
    def __str__(self):
        return f'ID={self._id}, name={self._name}, timestamp={self._timestamp}'

data = [
    CustomObject(1, "Alice", "2023-01-03"),
    CustomObject(2, "Bob", "2023-01-05"),
    CustomObject(1, "Alice", "2023-01-01"),
    CustomObject(3, "Eve", "2023-01-04"),
    CustomObject(2, "Bob", "2023-01-02"),
]

results = dict()

for obj in data:
    if (co := results.get(key := obj.key())) is None or co > obj:
        results[key] = obj

print(*results.values(), sep='\n')

输出：

ID=1, name=Alice, timestamp=2023-01-01
ID=2, name=Bob, timestamp=2023-01-02
ID=3, name=Eve, timestamp=2023-01-04

赞(0）回复(0）举报 2023-05-05

我来回答

如何在Python中基于多个属性有效地过滤掉列表中的重复对象？

4条答案

相关问题

热门标签

最新问答