如何在Python中基于多个属性有效地过滤掉列表中的重复对象?

xriantvc  于 2023-05-05  发布在  Python
关注(0)|答案(4)|浏览(113)

我正在处理一个Python项目,其中有一个自定义对象列表,我需要根据这些对象的多个属性过滤掉重复的对象。每个对象都有三个属性:idnametimestamp。如果idname属性都匹配列表中的另一个对象,我想将该对象视为重复对象。确定重复项时不应考虑timestamp属性。
下面是一个自定义对象类的示例:

class CustomObject:
    def __init__(self, id, name, timestamp):
        self.id = id
        self.name = name
        self.timestamp = timestamp

和对象的示例列表:

data = [
    CustomObject(1, "Alice", "2023-01-01"),
    CustomObject(2, "Bob", "2023-01-02"),
    CustomObject(1, "Alice", "2023-01-03"),
    CustomObject(3, "Eve", "2023-01-04"),
    CustomObject(2, "Bob", "2023-01-05"),
]

在本例中,我希望删除重复项,并保留最早的timestamp对象。
预期输出应为:

[
    CustomObject(1, "Alice", "2023-01-01"),
    CustomObject(2, "Bob", "2023-01-02"),
    CustomObject(3, "Eve", "2023-01-04"),
]

我知道我可以使用一个循环来比较列表中的每个对象和其他对象,但我担心性能,特别是当列表变大时。在Python中是否有更有效的方法来实现这一点,可能使用内置函数或库?

q43xntqr

q43xntqr1#

class CustomObject:
    def __init__(self, id, name, timestamp):
        self.id = id
        self.name = name
        self.timestamp = timestamp
import pandas as pd

data2 = [
    CustomObject(1, "Alice", "2023-01-01"),
    CustomObject(2, "Bob", "2023-01-02"),
    CustomObject(1, "Alice", "2023-01-03"),
    CustomObject(3, "Eve", "2023-01-04"),
    CustomObject(2, "Bob", "2023-01-05"),
]

df = pd.DataFrame(([vars(f) for f in data2]))

df.sort_values(['id', 'name', 'timestamp'])
df.drop_duplicates(subset=['id', 'name'], keep='first', inplace=True)
print(df)
id   name   timestamp
0   1  Alice  2023-01-01
1   2    Bob  2023-01-02
3   3    Eve  2023-01-04
h9vpoimq

h9vpoimq2#

您可以使用字典来跟踪基于idname属性的唯一对象,并在发现具有早期timestamp的对象时更新timestamp。这里有一个解决方案,应该比使用嵌套循环更有效:

class CustomObject:
    def __init__(self, id, name, timestamp):
        self.id = id
        self.name = name
        self.timestamp = timestamp

    def __repr__(self):
        return f"CustomObject({self.id}, {self.name}, {self.timestamp})"

data = [
    CustomObject(1, "Alice", "2023-01-01"),
    CustomObject(2, "Bob", "2023-01-02"),
    CustomObject(1, "Alice", "2023-01-03"),
    CustomObject(3, "Eve", "2023-01-04"),
    CustomObject(2, "Bob", "2023-01-05"),
]

unique_objects = {}
for obj in data:
    key = (obj.id, obj.name)
    if key not in unique_objects or obj.timestamp < unique_objects[key].timestamp:
        unique_objects[key] = obj

filtered_data = list(unique_objects.values())

print(filtered_data)
# Output: [CustomObject(1, Alice, 2023-01-01), CustomObject(2, Bob, 2023-01-02), CustomObject(3, Eve, 2023-01-04)]
ibps3vxo

ibps3vxo3#

对类做一些修改,使其在set中可用:

class CustomObject:
    def __init__(self, id, name, timestamp):
        self.id = id
        self.name = name
        self.timestamp = timestamp
    def __eq__(self, other):
        return self.id == other.id
    def __hash__(self):
        return hash(self.name)

现在你可以从列表中创建一个set

set(data)

如果数据尚未按日期排序,则需要先按日期排序。

g52tjvyc

g52tjvyc4#

如果在类中实现__gt__,可以使代码更简洁。对象之间的比较基于时间戳值,该时间戳值被假定为YYYY-MM-DD格式。这将适用于其他日期时间格式,因为它只是一个词汇比较:

class CustomObject:
    def __init__(self, _id, name, timestamp):
        self._id = _id
        self._name = name
        self._timestamp = timestamp
    def key(self):
        return self._id, self._name
    def __gt__(self, other):
        return isinstance(other, type(self)) and self._timestamp > other._timestamp
    def __str__(self):
        return f'ID={self._id}, name={self._name}, timestamp={self._timestamp}'

data = [
    CustomObject(1, "Alice", "2023-01-03"),
    CustomObject(2, "Bob", "2023-01-05"),
    CustomObject(1, "Alice", "2023-01-01"),
    CustomObject(3, "Eve", "2023-01-04"),
    CustomObject(2, "Bob", "2023-01-02"),
]

results = dict()

for obj in data:
    if (co := results.get(key := obj.key())) is None or co > obj:
        results[key] = obj

print(*results.values(), sep='\n')

输出:

ID=1, name=Alice, timestamp=2023-01-01
ID=2, name=Bob, timestamp=2023-01-02
ID=3, name=Eve, timestamp=2023-01-04

相关问题