numpy 为什么np.hypot和np.subtract.outer比vanilla broadcast快？有没有更快的方法来计算距离矩阵？

weylhg0b 于 12个月前发布在其他

关注(0)|答案(2)|浏览(116)

我有两个大的2D点集，需要计算距离矩阵。我需要它是快的，所以我使用NumPy广播。在计算距离矩阵的两种方法中，我不明白为什么一种比另一种好。
从here我有矛盾的结果。单元格[3，4，6]和[8，9]都计算距离矩阵，但3+4使用subtract.outer比使用广播的8快，6使用hypot比9快，这是简单的方法。我没有尝试Python循环，假设它永远不会完成。
1.有没有更快的方法来计算距离矩阵？
1.为什么hypot和subtract.outer更快？
代码（我更改了种子以防止缓存重用）：

### Cell 1
import numpy as np

np.random.seed(858442)

### Cell 2
%%time
obs = np.random.random((50000, 2))
interp = np.random.random((30000, 2))

CPU times: user 2.02 ms, sys: 1.4 ms, total: 3.42 ms
Wall time: 1.84 ms

### Cell 3
%%time
d0 = np.subtract.outer(obs[:,0], interp[:,0])

CPU times: user 2.46 s, sys: 1.97 s, total: 4.42 s
Wall time: 4.42 s

### Cell 4
%%time
d1 = np.subtract.outer(obs[:,1], interp[:,1])

CPU times: user 3.1 s, sys: 2.7 s, total: 5.8 s
Wall time: 8.34 s

### Cell 5
%%time
h = np.hypot(d0, d1)

CPU times: user 12.7 s, sys: 24.6 s, total: 37.3 s
Wall time: 1min 6s

### Cell 6
np.random.seed(773228)

### Cell 7
%%time
obs = np.random.random((50000, 2))
interp = np.random.random((30000, 2))

CPU times: user 1.84 ms, sys: 1.56 ms, total: 3.4 ms
Wall time: 2.03 ms

### Cell 8
%%time
d = obs[:, np.newaxis, :] - interp
d0, d1 = d[:, :, 0], d[:, :, 1]

CPU times: user 22.7 s, sys: 8.24 s, total: 30.9 s
Wall time: 33.2 s

### Cell 9
%%time
h = np.sqrt(d0**2 + d1**2)

CPU times: user 29.1 s, sys: 2min 12s, total: 2min 41s
Wall time: 6min 10s

numpy

来源：https://stackoverflow.com/questions/68591676/why-are-np-hypot-and-np-subtract-outer-very-fast-compared-to-vanilla-broadcast-a

2条答案

按热度按时间

6jjcrrmo1#

首先，d0和d1需要每个50000 x 30000 x 8 = 12 GB，这是相当大的。确保你有超过100 GB的内存，因为这是整个脚本所需要的！这是一个巨大的内存量。如果你没有足够的内存，操作系统将使用一个 * 存储设备 *（例如。swap）来存储速度慢得多的多余数据。实际上，Cell-4没有理由比Cell-3慢，我猜你已经没有足够的内存（完全）存储d1在RAM中，而d0似乎适合（大部分）在内存中。在我的机器上，当两者都可以容纳在RAM中时，没有区别（也可以颠倒操作的顺序来检查这一点）。这也解释了为什么进一步的操作往往会变得更慢。
也就是说，单元格8+9也较慢，因为它们创建临时数组，并且需要更多的内存传递来计算结果。实际上，表达式np.sqrt(d0**2 + d1**2)首先在内存中计算d0**2，得到一个新的12 GB临时数组，然后计算d1**2，得到另一个12 GB临时数组，然后对两个临时数组求和，得到另一个新的12 GB临时数组，最后计算平方根，得到另一个12 GB临时数组。这可能需要高达48 GB的内存，并需要4个读写内存绑定通道。这是没有效率的，没有有效地使用CPU/RAM（例如。CPU缓存）。
有一个更快的实现，包括在1遍内完成整个计算，并使用Numba的JIT并行。下面是一个示例：

import numba as nb
@nb.njit(parallel=True)
def distanceMatrix(a, b):
    res = np.empty((a.shape[0], b.shape[0]), dtype=a.dtype)
    for i in nb.prange(a.shape[0]):
        for j in range(b.shape[0]):
            res[i, j] = np.sqrt((a[i, 0] - b[j, 0])**2 + (a[i, 1] - b[j, 1])**2)
    return res

此实现使用3倍少的内存（仅12 GB），并且比使用subtract.outer的实现快得多。事实上，由于交换，细胞3+4+5需要几分钟，而这一个需要1.3秒！

外卖是内存访问是昂贵的，以及临时数组。当处理巨大的缓冲区时，需要避免在内存中使用多个通道，并在执行的计算不平凡时利用CPU缓存（例如通过使用数组块）。

赞(0）回复(0）举报 12个月前

9rbhqvlz2#

感谢Jérôme Richard here更新

Stackoverflow从不让人失望
使用numba有一种更快的方法
它有一个即时编译器，可以将Python代码段转换为快速的机器代码，第一次使用它会比随后的使用慢一点，因为它编译。但即使是第一次njit平行击败hypot +减去。外部的9倍保证金（49000，12000）矩阵

各种方法性能

确保每次运行脚本时使用不同的种子

import sys
import time

import numba as nb
import numpy as np

np.random.seed(int(sys.argv[1]))

d0 = np.random.random((49000, 2))
d1 = np.random.random((12000, 2))

def f1(d0, d1):
    print('Numba without parallel')
    res = np.empty((d0.shape[0], d1.shape[0]), dtype=d0.dtype)
    for i in nb.prange(d0.shape[0]):
        for j in range(d1.shape[0]):
            res[i, j] = np.sqrt((d0[i, 0] - d1[j, 0])**2 + (d0[i, 1] - d1[j, 1])**2)
    return res

# Add eager compilation, compiles before hand
@nb.njit((nb.float64[:, :], nb.float64[:, :]), parallel=True)
def f2(d0, d1):
    print('Numba with parallel')
    res = np.empty((d0.shape[0], d1.shape[0]), dtype=d0.dtype)
    for i in nb.prange(d0.shape[0]):
        for j in range(d1.shape[0]):
            res[i, j] = np.sqrt((d0[i, 0] - d1[j, 0])**2 + (d0[i, 1] - d1[j, 1])**2)
    return res

def f3(d0, d1):
    print('hypot + subtract.outer')
    np.hypot(
        np.subtract.outer(d0[:,0], d1[:,0]),
        np.subtract.outer(d0[:,1], d1[:,1])
    )

if __name__ == '__main__':
    s1 = time.time()
    eval(f'{sys.argv[2]}(d0, d1)')
    print(time.time() - s1)

(base) ~/xx@xx:~/xx$ python3 test.py 523432 f3
hypot + subtract.outer
9.79756784439087
(base) xx@xx:~/xx$ python3 test.py 213622 f2
Numba with parallel
0.3393140316009521

我将更新这篇文章，以进一步发展，如果我发现更快的方法

赞(0）回复(0）举报 12个月前

我来回答

numpy 为什么np.hypot和np.subtract.outer比vanilla broadcast快？有没有更快的方法来计算距离矩阵？

2条答案

感谢Jérôme Richard here更新

各种方法性能

我将更新这篇文章，以进一步发展，如果我发现更快的方法

相关问题

热门标签

最新问答

numpy 为什么np.hypot和np.subtract.outer比vanilla broadcast快？有没有更快的方法来计算距离矩阵？

2条答案

感谢Jérôme Richardhere更新

各种方法性能

我将更新这篇文章，以进一步发展，如果我发现更快的方法

相关问题

热门标签

最新问答

感谢Jérôme Richard here更新