python 列表字典中列表的有效过滤

643ylb08  于 2023-03-11  发布在  Python
关注(0)|答案(2)|浏览(148)

我正在处理一些相当大的数据集(500,000个数据点,每个数据点有30个变量),希望找到最有效的过滤方法。
为了与现有代码兼容,数据被构造为列表字典,但不能转换(例如转换为PandasDataFrame),必须就地过滤。
工作示例:

data = {'Param0':['x1','x2','x3','x4','x5','x6'],
        'Param1':['A','A','A','B','B','C'],
        'Param2': [100,200,150,80,90,50],
        'Param3': [20,60,40,30,30,5]}

# Param0 keys to keep
keep = ['x2', 'x4']

filtered = {k: [x for i, x in enumerate(v) if data['Param0'][i] in keep] for k, v in data.items()}

结果filtered给出了所需的输出,但这在规模上非常慢。
有没有更快的方法?

b1zrtrql

b1zrtrql1#

我会这样做:

keep_idx = [i for i, v in enumerate(data['Param0']) if v in keep]
filtered = {k: [v[i] for i in keep_idx] for k, v in data.items()}

时间

import numpy as np
from timeit import timeit

# Solution in question
def test_1(data, keep):
    return {
        k: [x for i, x in enumerate(v) if data['Param0'][i] in keep]
        for k, v in data.items()
    }

# First solution from @I'mahdi
def test_2(data, keep):
    keep_idx = [i for i, v in enumerate(data['Param0']) if v in keep]
    return {
        k: [val for i, val in enumerate(v) if i in keep_idx]
        for k, v in data.items()
    }

# Second solution from @I'mahdi
def test_3(data, keep):
    keep_idx = [i for i, v in enumerate(data['Param0']) if v in keep]
    return {k: list(np.asarray(v)[keep_idx]) for k, v in data.items()}

# Solution in this answer
def test_4(data, keep):
    keep_idx = [i for i, v in enumerate(data['Param0']) if v in keep]
    return {k: [v[i] for i in keep_idx] for k, v in data.items()}

data = {f"Param{i}": list(range(10_000)) for i in range(20)}
keep = list(range(0, 10_000, 100))

print(test_1(data, keep) == test_2(data, keep))
print(test_2(data, keep) == test_3(data, keep))
print(test_3(data, keep) == test_4(data, keep))

for i in range(1, 5):
    t = timeit(f"test_{i}(data, keep)", globals=globals(), number=10)
    print(f"Solution {i}: {t:.3f}")

结果如下:

Solution 1: 4.571
Solution 2: 4.220
Solution 3: 0.298
Solution 4: 0.219
qcuzuvrc

qcuzuvrc2#

首先创建一个look_up_idx可能是一个更好的主意:

look_up_idx = [idx for idx, v in enumerate(data['Param0']) if v in keep]
filtered = {k: [v for idx, val in enumerate(v) if idx in look_up_idx] for k, v in data.items()}
print(filtered)

或者加上使用numpy
x一个一个一个一个x一个一个二个x

相关问题