numpy 矢量化花费的时间长于循环

lb3vh1jj  于 2023-10-19  发布在  其他
关注(0)|答案(1)|浏览(153)

我的函数计算洛伦兹给定频率,fwhm,amp。我想对它进行矢量化,以便它计算频率,fwhm和amps的列表:

def lorz1(freq_series, freq, fwhm, amp):
    numerator   = fwhm
    denominator = (2*np.pi) * ((freq_series[:,None] - freq)**2 + fwhm**2/4)
    lor         = numerator / denominator
    main_peak   = amp*(lor/np.linalg.norm(lor, axis=0))
    return np.sum(main_peak, axis=1)

def lorz2(freq_series, freq, fwhm, amp):
    numerator   = fwhm[:,None]
    denominator = (2*np.pi) * ((freq_series - freq[:,None])**2 + fwhm[:,None]**2/4)
    lor         = numerator / denominator
    main_peak   = amp[:,None]*(lor/np.linalg.norm(lor, axis=1)[:,None])
    return np.sum(main_peak, axis=0)

def lorz3(freq_series, freq, fwhm, amp):
    numerator   = fwhm
    denominator = (2*np.pi) * ((freq_series - freq)**2 + fwhm**2/4)
    lor         = numerator / denominator
    main_peak   = amp*(lor/np.linalg.norm(lor))
    return main_peak

series = np.linspace(0,100,50000)
freq   = np.random.uniform(5,50,50)
fwhm   = np.random.uniform(0.01,0.05,50)
amps   = np.random.uniform(5,500,50)

时间:

%timeit lorz1(series, freq, fwhm, amps)

38.4 ms ± 1.7 ms/循环(平均值±标准差)运行7次,每次循环10次)

%timeit lorz2(series, freq, fwhm, amps)

29.8 ms ± 1.8 ms/循环(平均值±标准差)运行7次,每次循环10次)

%timeit np.sum(np.array([lorz3(series, item1, item2, item3)
                         for (item1,item2,item3) in zip(freq, fwhm, amps)]), axis=0)

24.1 ms ± 5.02 ms/循环(平均值±标准差)运行7次,每次循环10次)
我在lorz1lorz2中的向量化哪里出错了?不是应该比lorz3快吗?

6ljaweal

6ljaweal1#

我使用两个版本做了一些进一步的分析:
版本1:

#!/usr/bin/env python3

import numpy as np

def lorz2(freq_series, freq, fwhm, amp):
    numerator   = fwhm[:,None]
    denominator = (2*np.pi) * ((freq_series - freq[:,None])**2 + fwhm[:,None]**2/4)
    lor         = numerator / denominator
    main_peak   = amp[:,None]*(lor/np.linalg.norm(lor, axis=1)[:,None])
    return np.sum(main_peak, axis=0)

series = np.linspace(0,100,50000)
freq   = np.random.uniform(5,50,50)
fwhm   = np.random.uniform(0.01,0.05,50)
amps   = np.random.uniform(5,500,50)

for _ in range(100):
    lorz2(series, freq, fwhm, amps)

版本2:

#!/usr/bin/env python3

import numpy as np

def lorz3(freq_series, freq, fwhm, amp):
    numerator   = fwhm
    denominator = (2*np.pi) * ((freq_series - freq)**2 + fwhm**2/4)
    lor         = numerator / denominator
    main_peak   = amp*(lor/np.linalg.norm(lor))
    return main_peak

series = np.linspace(0,100,50000)
freq   = np.random.uniform(5,50,50)
fwhm   = np.random.uniform(0.01,0.05,50)
amps   = np.random.uniform(5,500,50)

for _ in range(100):
    sum(lorz3(series, item1, item2, item3)
        for (item1,item2,item3) in zip(freq, fwhm, amps))

请注意,我是如何将lorz3的求和调整为普通的Python sum的。这在我的测试中更快,因为它避免了临时数组的构造。
以下是我做的一些分析的结果:

perf stat -ddd ./lorz2.py

 Performance counter stats for './lorz2.py':

           2729.16 msec task-clock:u                     #    1.000 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
            217114      page-faults:u                    #   79.554 K/sec                     
        8192141440      cycles:u                         #    3.002 GHz                         (38.43%)
        3178961202      instructions:u                   #    0.39  insn per cycle              (46.12%)
         426575242      branches:u                       #  156.303 M/sec                       (53.81%)
           2177628      branch-misses:u                  #    0.51% of all branches             (61.51%)
       42020185035      slots:u                          #   15.397 G/sec                       (69.20%)
         323473974      topdown-retiring:u               #      0.6% Retiring                   (69.20%)
       33616148028      topdown-bad-spec:u               #     67.1% Bad Speculation            (69.20%)
         371211166      topdown-fe-bound:u               #      0.7% Frontend Bound             (69.20%)
       15767347418      topdown-be-bound:u               #     31.5% Backend Bound              (69.20%)
         813550722      L1-dcache-loads:u                #  298.096 M/sec                       (69.19%)
         546814255      L1-dcache-load-misses:u          #   67.21% of all L1-dcache accesses   (69.21%)
          82889242      LLC-loads:u                      #   30.372 M/sec                       (69.22%)
          67633317      LLC-load-misses:u                #   81.59% of all LL-cache accesses    (69.24%)
   <not supported>      L1-icache-loads:u                                                     
           9705348      L1-icache-load-misses:u          #    0.00% of all L1-icache accesses   (30.81%)
         864895659      dTLB-loads:u                     #  316.909 M/sec                       (30.79%)
            117310      dTLB-load-misses:u               #    0.01% of all dTLB cache accesses  (30.78%)
   <not supported>      iTLB-loads:u                                                          
             85530      iTLB-load-misses:u               #    0.00% of all iTLB cache accesses  (30.76%)
   <not supported>      L1-dcache-prefetches:u                                                
   <not supported>      L1-dcache-prefetch-misses:u                                           

       2.729696014 seconds time elapsed

       1.932708000 seconds user
       0.796504000 seconds sys

这里是更快的版本:

Performance counter stats for './lorz3.py':

            878.49 msec task-clock:u                     #    0.999 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
             52869      page-faults:u                    #   60.182 K/sec                     
        3704170220      cycles:u                         #    4.217 GHz                         (38.22%)
        3735225800      instructions:u                   #    1.01  insn per cycle              (45.96%)
         568575253      branches:u                       #  647.221 M/sec                       (53.70%)
           2580294      branch-misses:u                  #    0.45% of all branches             (61.43%)
       18355798588      slots:u                          #   20.895 G/sec                       (69.17%)
        3328525030      topdown-retiring:u               #     17.5% Retiring                   (69.17%)
        6982401815      topdown-bad-spec:u               #     36.6% Bad Speculation            (69.17%)
        1297505291      topdown-fe-bound:u               #      6.8% Frontend Bound             (69.17%)
        7459691283      topdown-be-bound:u               #     39.1% Backend Bound              (69.17%)
         858082535      L1-dcache-loads:u                #  976.773 M/sec                       (69.28%)
         430569310      L1-dcache-load-misses:u          #   50.18% of all L1-dcache accesses   (69.40%)
          15723297      LLC-loads:u                      #   17.898 M/sec                       (69.49%)
             73709      LLC-load-misses:u                #    0.47% of all LL-cache accesses    (69.50%)
   <not supported>      L1-icache-loads:u                                                     
          38705486      L1-icache-load-misses:u          #    0.00% of all L1-icache accesses   (30.72%)
         860276161      dTLB-loads:u                     #  979.270 M/sec                       (30.60%)
             86213      dTLB-load-misses:u               #    0.01% of all dTLB cache accesses  (30.51%)
   <not supported>      iTLB-loads:u                                                          
             91069      iTLB-load-misses:u               #    0.00% of all iTLB cache accesses  (30.50%)
   <not supported>      L1-dcache-prefetches:u                                                
   <not supported>      L1-dcache-prefetch-misses:u                                           

       0.878946776 seconds time elapsed

       0.852205000 seconds user
       0.026744000 seconds sys

请注意,在更快的代码中,指令的数量实际上略高,这是有道理的,因为它的向量化程度较低,但每个周期的指令数量更高,使其整体速度更快。在较慢的版本中有两倍多的LLC加载,其中大多数未命中,而这里几乎所有命中。我不知道如何解释topdown-bad-spec计数器。也许其他人可以对此发表评论。
CPU甚至时钟下降(这是可重复的),这支持了它只是在等待内存的想法。
此外,请注意最后一行中的sys时间。lorz2将28%的运行时间花在内核空间上。因为它不做任何与IO相关的事情,所以这是所有的内存分配和释放开销。
我们可以再进一步看一下失速的原因:

perf stat -e cycles,l1d_pend_miss.l2_stall,cycle_activity.stalls_l3_miss ./lorz2.py

 Performance counter stats for './lorz2.py':

        8446540078      cycles:u                                                              
        1953955881      l1d_pend_miss.l2_stall:u                                              
        3050292324      cycle_activity.stalls_l3_miss:u                                       

       2.748141433 seconds time elapsed

       1.959570000 seconds user
       0.788443000 seconds sys
perf stat -e cycles,l1d_pend_miss.l2_stall,cycle_activity.stalls_l3_miss ./lorz3.py

 Performance counter stats for './lorz3.py':

        3674547216      cycles:u                                                              
         303870088      l1d_pend_miss.l2_stall:u                                              
          16939496      cycle_activity.stalls_l3_miss:u                                       

       0.869909182 seconds time elapsed

       0.848122000 seconds user
       0.021752000 seconds sys

因此,lorz2版本只是在2级或3级缓存未命中时不断停顿。
我们可以进一步看一个简单的perf report

perf record ./lorz2.py
perf report

# Overhead  Command  Shared Object                                      Symbol                                         
# ........  .......  .................................................  ...............................................
#
    32.44%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_multiply
    27.03%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_divide
    17.34%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_add
     6.93%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_subtract
     6.12%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_pairwise_sum
     5.32%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_square
     0.51%  python3  [unknown]                                          [k] 0xffffffffb2800fe7
     0.46%  python3  libpython3.11.so.1.0                               [.] _PyEval_EvalFrameDefault
     0.19%  python3  libpython3.11.so.1.0                               [.] unicodekeys_lookup_unicode
     0.18%  python3  libpython3.11.so.1.0                               [.] gc_collect_main
...
perf record ./lorz3.py
perf report

# Overhead  Command  Shared Object                                      Symbol                                       
# ........  .......  .................................................  .............................................
#
    27.56%  python3  libblas.so.3.11.0                                  [.] ddot_
    27.47%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_divide
     8.64%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_subtract
     8.34%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_add
     5.84%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_multiply
     3.70%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_square
     1.47%  python3  libpython3.11.so.1.0                               [.] _PyEval_EvalFrameDefault
     1.38%  python3  libgcc_s.so.1                                      [.] execute_cfa_program
     0.89%  python3  libgcc_s.so.1                                      [.] uw_update_context_1
     0.88%  python3  libgcc_s.so.1                                      [.] _Unwind_Find_FDE
     0.64%  python3  libgcc_s.so.1                                      [.] uw_frame_state_for
     0.61%  python3  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] ufunc_generic_fastcall
...

嗯,有意思。点积从何而来?我假设这就是linalg.norm对简单向量的实现方式。
顺便说一句,我们可以通过3种措施稍微加快lorz2lorz3版本的速度:
1.将乘法和求和合并为一个矩阵乘法
1.重新排序一些操作以在较小的数组(或标量)上执行它们
1.将除法替换为乘法,

def lorz2a(freq_series, freq, fwhm, amp):
    numerator   = fwhm[:,None] * (0.5 / np.pi)
    denominator = (freq_series - freq[:,None])**2 + 0.25 * fwhm[:,None]**2
    lor         = numerator / denominator
    return (amp / np.linalg.norm(lor, axis=1)) @ lor

    
def lorz3a(freq_series, freq, fwhm, amp):
    numerator   = fwhm * (0.5 / np.pi)
    denominator = (freq_series - freq)**2 + 0.25 * fwhm**2
    lor         = numerator / denominator
    main_peak   = amp / np.linalg.norm(lor) * lor
    return main_peak

但这并没有改变整体趋势。

总结

Numpy向量化主要有助于减少每次调用的开销。一旦数组足够大,我们就不会从中获得太多好处,因为与计算本身相比,剩余的解释器开销很小。同时,较大的阵列导致存储器效率降低。通常,在L2或L3缓存大小附近有一个最佳点。lorz3实现比其他实现更好地达到了这一点。
对于较小的系列大小和较大的其他阵列大小,我们可以预期lorz2的性能更好。例如,这个数据集使我的lorz2a比我的lorz3a快:

series = np.linspace(0,100,1000)
freq   = np.random.uniform(5,50,2000)
fwhm   = np.random.uniform(0.01,0.05,2000)
amps   = np.random.uniform(5,500,2000)

Numpy的简单、急切的评估方案将调优的责任放在了用户身上。像NumExpr这样的其他库试图避免这种情况。

相关问题