python 如何更快地获得两个大列表之间的索引关系列表?

lokaqttq  于 2023-04-28  发布在  Python
关注(0)|答案(4)|浏览(159)

问题给出了以下两个列表。

import numpy as np
import random as rnd

num = 100000

a = list(range(num))
b = [rnd.randint(0, num) for r in range(num)]

在两个具有巨大大小的列表之间(假设引用列表是a),使用列表(atob)解析方法创建了一个列表,该列表指示相同元素在相对数组(b)中的位置。

atob = [np.abs(np.subtract(b, i)).argmin() for i in a]
print(f"index a to b: {atob}")

当列表大小很小时,它不会花费很长时间。然而,我意识到获取列表atob的过程相当耗时。
有没有更快获取列表atob的方法?还是目前没有办法?
(回答后编辑。这次修订的目的是为了将来的读者。)非常感谢您的回复!根据答案进行代码分析。

  • 检查输出

使用num = 20进行结果比较。

import numpy as np
import random as rnd
import time

# set lists
num = 20
a = list(range(num))
# b = [rnd.randint(0, num) for r in range(num)] # Duplicate numbers occur among the elements in the list
b = rnd.sample(range(0, num), num)
print(f"list a: {a}")
print(f"list b: {b}\n")

# set array as same as lists
arr_a = np.array(range(num))
arr_b = np.array(rnd.sample(range(0, num), num))

# --------------------------------------------------------- #
# existing method
ck_time = time.time()
atob = [np.abs(np.subtract(b, i)).argmin() for i in a]
print(f"index a to b (existed): {atob}, type: {type(atob)}")
print(f"running time (existed): {time.time() - ck_time}\n")
ck_time = time.time()

# dankal444 method
dk = {val: idx for idx, val in enumerate(b)}
atob_dk = [dk.get(n) for n in a] # same as atob_dk = [d.get(n) for n in range(num)]
print(f"index a to b (dankal): {atob_dk}, type: {type(atob_dk)}")
print(f"running time (dankal): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_dk)}\n")
ck_time = time.time()

# smp55 method
comb = np.array([a, b]).transpose()
atob_smp = comb[comb[:, 1].argsort()][:, 0]
print(f"index a to b (smp55): {atob_smp}, type: {type(atob_smp)}")
print(f"running time (smp55): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_smp)}\n")
ck_time = time.time()

# Roman method
from numba import jit
@jit(nopython=True)
def find_argmin(_a, _b):
    out = np.empty_like(_a)  # allocating result array
    for i in range(_a.shape[0]):
        out[i] = np.abs(np.subtract(_b, _a[i])).argmin()
    return out

atob_rom = find_argmin(arr_a, arr_b)
print(f"index a to b (Roman): {atob_rom}, type: {type(atob_rom)}")
print(f"running time (Roman): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_rom)}\n")
ck_time = time.time()

# Alain method
from bisect import bisect_left
ub   = {n:-i for i,n in enumerate(reversed(b),1-len(b))}  # unique first pos
sb   = sorted(ub.items())                                 # sorted to bisect
ib   = (bisect_left(sb,(n,0)) for n in a)                 # index of >= val
rb   = ((sb[i-1],sb[min(i,len(sb)-1)]) for i in ib)       # low and high pairs
atob_ala = [ i if (abs(lo-n),i)<(abs(hi-n),j) else j      # closest index
               for ((lo,i),(hi,j)),n in zip(rb,a) ]
print(f"index a to b (Alain): {atob_ala}, type: {type(atob_ala)}")
print(f"running time (Alain): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_ala)}\n")
ck_time = time.time()

# ken method
b_sorted, b_sort_indices = np.unique(b, return_index=True)
def find_nearest(value):
    """Finds the nearest value from b."""
    right_index = np.searchsorted(b_sorted[:-1], value)
    left_index = max(0, right_index - 1)
    right_delta = b_sorted[right_index] - value
    left_delta = value - b_sorted[left_index]
    if right_delta == left_delta:
        # This is only necessary to replicate the behavior of your original code.
        return min(b_sort_indices[left_index], b_sort_indices[right_index])
    elif left_delta < right_delta:
        return b_sort_indices[left_index]
    else:
        return b_sort_indices[right_index]

atob_ken = [find_nearest(ai) for ai in a]
print(f"index a to b (ken): {atob_ken}, type: {type(atob_ken)}")
print(f"running time (ken): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_ken)}\n")
ck_time = time.time()

上面的代码结果是

list a: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
list b: [9, 12, 0, 2, 3, 15, 4, 16, 13, 6, 7, 18, 14, 10, 1, 8, 5, 17, 11, 19]

index a to b (existed): [2, 14, 3, 4, 6, 16, 9, 10, 15, 0, 13, 18, 1, 8, 12, 5, 7, 17, 11, 19], type: <class 'list'>
running time (existed): 0.00024008750915527344

index a to b (dankal): [2, 14, 3, 4, 6, 16, 9, 10, 15, 0, 13, 18, 1, 8, 12, 5, 7, 17, 11, 19], type: <class 'list'>
running time (dankal): 1.5497207641601562e-05
equal to exist method: True

index a to b (smp55): [ 2 14  3  4  6 16  9 10 15  0 13 18  1  8 12  5  7 17 11 19], type: <class 'numpy.ndarray'>
running time (smp55): 0.00020551681518554688
equal to exist method: True

index a to b (Roman): [17 11  1  6 16 14  9  4  8  3  5 12  7  2 19 15 18 13  0 10], type: <class 'numpy.ndarray'>
running time (Roman): 0.5710980892181396
equal to exist method: False

index a to b (Alain): [2, 14, 3, 4, 6, 16, 9, 10, 15, 0, 13, 18, 1, 8, 12, 5, 7, 17, 11, 19], type: <class 'list'>
running time (Alain): 3.552436828613281e-05
equal to exist method: True

index a to b (ken): [2, 14, 3, 4, 6, 16, 9, 10, 15, 0, 13, 18, 1, 8, 12, 5, 7, 17, 11, 19], type: <class 'list'>
running time (ken): 0.00011754035949707031
equal to exist method: True
  • 增加列表大小的运行时间

如果我用num = 1000000运行代码

running time (dankal): 0.45094847679138184

running time (smp55): 0.36011743545532227

running time (Alain): 2.178112030029297

running time (ken): 2.663684368133545

(With Roman的方法,很难检查尺寸增加的时间。)
从内存的Angular 来看,也需要检查,但首先,@smp55方法是根据回复所需时间获得列表的最快方法。(我相信还有其他更好的方法。)
再次感谢大家的关注和回复!!!
(欢迎以后的答复和评论。如果有人有好主意,分享一下就好了!)

zengzsys

zengzsys1#

我可以给予一个非常快速的具体答案,因为你的第一个列表只是一个索引。如果你将它们组合在一个2D数组中,你可以按第二个列表排序,这会把第一个列表(第二个列表的索引)按你想要的结果顺序排列:

import numpy as np
import random as rnd

num = 100000

a = list(range(num))
b = [rnd.randint(0, num) for r in range(num)]

comb = np.array([a, b]).transpose()
atob = comb[comb[:, 1].argsort()][:,0]

耗时约0.08秒。现在,atob中的第一个项目是b中的索引,a中的第一个项目出现在这里。atob中的第二项是a的第二项在b中的索引,以此类推。

gdx19jrr

gdx19jrr2#

正如@dankal444在评论部分所建议的那样,排序是一个很好的方法。下面的代码完美地复制了代码的结果,但是执行时间大约为0。在我的电脑上25秒。

import numpy as np
import random as rnd

num = 100000

a = list(range(num))
b = np.array([rnd.randint(0, num) for _ in range(num)])

# You can also create them with numpy like this.
# a = np.arange(num)
# b = np.random.randint(0, num, size=num)

# This will sort and remove duplicate items at the same time.
b_sorted, b_sort_indices = np.unique(b, return_index=True)

def find_nearest(value):
    """Finds the nearest value from b."""
    right_index = np.searchsorted(b_sorted[:-1], value)
    left_index = max(0, right_index - 1)
    right_delta = b_sorted[right_index] - value
    left_delta = value - b_sorted[left_index]
    if right_delta == left_delta:
        # This is only necessary to replicate the behavior of your original code.
        return min(b_sort_indices[left_index], b_sort_indices[right_index])
    elif left_delta < right_delta:
        return b_sort_indices[left_index]
    else:
        return b_sort_indices[right_index]

atob = [find_nearest(ai) for ai in a]

也许我们还可以通过对a进行排序来使它更快,但我不知道你希望它有多快,所以我现在就把这个作为我的答案。

xkftehaa

xkftehaa3#

您可以使用字典来构建一个b值列表,这些值与它们第一次出现的索引相关联。然后对这些进行排序以允许二分搜索(使用二等分)
对已排序的数据使用二分搜索,找到a的每个值的位置。这将是b中下一个更高的值。
排序列表中的前一项给出了前一个较低值的值和位置。
最后,根据a中每个数字的差值选择上一个或下一个值的索引:

import random as rnd
from bisect import bisect_left
    
num = 10 # 0000

a = list(range(num))
b = [rnd.randint(0, num) for r in range(num)]

ub   = {n:-i for i,n in enumerate(reversed(b),1-len(b)) } # unique first pos
sb   = sorted(ub.items())                                 # allow bisect
ib   = ( bisect_left(sb,(n,0)) for n in a )               # index of >= val
lhb  = ( (sb[i-1],sb[min(i,len(sb)-1)]) for i in ib )     # low and high pairs
atob = [ i if (abs(lo-n),i)<(abs(hi-n),j) else j          # closest index
           for ((lo,i),(hi,j)),n in zip(lhb,a) ]

print(a)
print(b)
print(atob)

输出:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[6, 1, 7, 9, 4, 5, 7, 4, 6, 10]
[1, 1, 1, 4, 4, 5, 0, 2, 2, 3]

num=100000的时间为0。19秒
时间复杂度为O(NlogN)

wkyowqbh

wkyowqbh4#

在你的例子中,我建议使用Numba(Python的即时编译器)来加速numpy的计算:

import numpy as np
import numba
from numba import jit
import random

num = 100000
a = np.array(range(num))
b = np.array([random.randint(0, num) for r in range(num)])

@jit(nopython=True)
def find_argmins(a, b):
    out = np.empty_like(a)  # allocating result array
    for i in range(a.shape[0]):
        out[i] = np.abs(np.subtract(b, a[i])).argmin()
    return out

运行:

find_argmins(a, b)
array([69772, 69772, 32964, ...,  7648, 92904,  4006])

时间性能(在具有2 Gb内存的小型虚拟机上):

%timeit find_argmins(a, b)
12 s ± 739 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

相关问题