numpy 为什么Jitted Numba函数比原始函数慢?

kjthegm6  于 12个月前  发布在  其他
关注(0)|答案(1)|浏览(74)

我已经写了一个函数来在磁盘上创建均匀间隔的点,因为它经常运行,并且在相对较大的数组上运行,所以我认为numba的应用程序会显着提高速度。然而,在运行一个快速测试时,我发现numba函数的速度要慢两倍多。
有没有办法找出是什么减慢了numba函数的速度?
函数如下:

@njit(cache=True)
def generate_points_turbo(centre_point, radius, num_rings, x_axis=np.array([-1, 0, 0]), y_axis=np.array([0, 1, 0])):
    """
    Generate uniformly spaced points inside a circle
    Based on algorithm from:
    http://www.holoborodko.com/pavel/2015/07/23/generating-equidistant-points-on-unit-disk/
    
    Parameters
    ----------
    centre_point : np.ndarray (1, 3)
    radius : float/int
    num_rings : int
    x_axis : np.ndarray
    y_axis : np.ndarray

    Returns
    -------
    points : np.ndarray (n, 3)

    """
    if num_rings > 0:
        delta_R = 1 / num_rings
        ring_radii = np.linspace(delta_R, 1, int(num_rings)) * radius
        k = np.arange(num_rings) + 1
        points_per_ring = np.rint(np.pi / np.arcsin(1 / (2*k))).astype(np.int32)
        num_points = points_per_ring.sum() + 1
        ring_indices = np.zeros(int(num_rings)+1)
        ring_indices[1:] = points_per_ring.cumsum()
        ring_indices += 1
        points = np.zeros((num_points, 3))

        points[0, :] = centre_point

        for indx in range(len(ring_radii)):
            theta = np.linspace(0, 2 * np.pi, points_per_ring[indx]+1)
            points[ring_indices[indx]:ring_indices[indx+1], :] = ((ring_radii[indx] * np.cos(theta[1:]) * x_axis[:, None]).T
                     + (ring_radii[indx] * np.sin(theta[1:]) * y_axis[:, None]).T)
        return points + centre_point

它的名字是这样的:

centre_point = np.array([0,0,0])
radius = 1
num_rings = 15

generate_points_turbo(centre_point, radius, num_rings )

如果有人知道为什么在编译numba时函数会变慢,或者如何找出numba函数的瓶颈是什么,那就太好了。

更新:可能的计算机特定大小依赖性

看起来numba函数正在工作,但是它的速度更快和更慢之间的交叉可能是硬件特定的。

%timeit generate_points(centre_point, 1, 2)
99.5 µs ± 932 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit generate_points_turbo(centre_point, 1, 2)
213 µs ± 8.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit generate_points(centre_point, 1, 20)
647 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit generate_points_turbo(centre_point, 1, 20)
314 µs ± 8.74 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit generate_points(centre_point, 1, 200)
11.9 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit generate_points_turbo(centre_point, 1, 200)
7.9 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

在大约12-15次振铃后,numba函数(*_turbo)开始在我的机器上变得类似或更快的速度,但在更大尺寸下的性能增益小于预期。但看起来它实际上是工作的,只是功能的某些部分严重依赖于大小。

ctzwtxfj

ctzwtxfj1#

我摆脱了你所有的换位/ newaxis / 3D的东西,你没有使用,并得到了一个x20的提升相比,你原来的解决方案。我用prange替换了range,因为你不在乎你的点是按什么顺序计算的。

# Imports.
import matplotlib.pyplot as plt
from numba import njit, prange
import numpy as np

# "Turbo" function.
@njit(cache=True)
def generate_points_turbo(centre_point, radius, num_rings):
    """
    Generate uniformly spaced points inside a circle
    Based on algorithm from:
    http://www.holoborodko.com/pavel/2015/07/23/generating-equidistant-points-on-unit-disk/

    Parameters
    ----------
    centre_point : np.ndarray (2,)
    radius : float/int
    num_rings : int
    x_axis : np.ndarray
    y_axis : np.ndarray

    Returns
    -------
    points : np.ndarray (n, 2)

    """
    if not num_rings > 0:
        return

    delta_R = 1 / num_rings
    ring_radii = np.linspace(delta_R, 1, num_rings) # Use a unit circle that we will scale only at the end.
    k = np.arange(num_rings) + 1
    points_per_ring = np.rint(np.pi / np.arcsin(1 / (2*k))).astype(np.int32)
    num_points = points_per_ring.sum() + 1

    points = np.zeros((num_points, 2))
    n = 1 # n == 0 is the central point by design.

    for ring_number in prange(len(ring_radii)):
        r = ring_radii[ring_number] # The local radius between 1/num_rings and 1.
        points_on_this_ring = points_per_ring[ring_number]
        theta = np.linspace(0, 2 * np.pi, points_on_this_ring)
        points[n: n+points_on_this_ring, 0] = r * np.cos(theta)
        points[n: n+points_on_this_ring, 1] = r * np.sin(theta)
        n += points_on_this_ring

    return points * radius + centre_point


# Test that the result is accurate.
if __name__ == "__main__":

    centre_point = np.array([0, 0])
    radius = 3.14159
    num_rings = 10

    p = generate_points_turbo(centre_point, radius, num_rings)
    fig, ax = plt.subplots()
    ax.set_aspect(1)
    ax.scatter(*p.T)
    fig.show()

# Test time taken.
 >>> from timeit import timeit
 >>> from initial_code import generate_points_turbo as generate_points_turbo_stackoverflow

 >>> %timeit generate_points_turbo(centre_point, radius, num_rings)
 >>> 13.5 µs ± 21.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
 >>> %timeit generate_points_turbo_stackoverflow(np.array([0, 0, 0]), radius, num_rings)
 >>> 261 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

相关问题