Mac M1上的Numpy异常缓慢

vlf7wbxs  于 2023-02-04  发布在  Mac
关注(0)|答案(2)|浏览(270)

我给我的小宝贝做了个简单的速度测试:

import numpy as np

A = np.random.rand(1000, 1000)
B = np.random.rand(1000, 1000)

%timeit A.dot(B)

结果是:

30.3 ms ± 829 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

与其他人通常看到的结果相比(平均不到10毫秒),这个结果似乎异常缓慢。我想知道这种行为的原因可能是什么。
我的系统是MacOS Big Sur,基于M1芯片,Python版本是3.8.13,numpy版本是1.22.4,numpy是通过

pip install "numpy==1.22.4"

np.show_config()的输出为:

openblas64__info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42
    not found = AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_KNL,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

编辑:
我用这个代码片段做了另一个测试(来自1):

import time
import numpy as np
np.random.seed(42)
a = np.random.uniform(size=(300, 300))
runtimes = 10

timecosts = []
for _ in range(runtimes):
    s_time = time.time()
    for i in range(100):
        a += 1
        np.linalg.svd(a)
    timecosts.append(time.time() - s_time)

print(f'mean of {runtimes} runs: {np.mean(timecosts):.5f}s')

我的测试结果是:

mean of 10 runs: 6.17438s

而网站1上的参考结果是:(芯片为M1 Max)

+-----------------------------------+-----------------------+--------------------+
|   Python installed by (run on)→   | Miniforge (native M1) | Anaconda (Rosseta) |
+----------------------+------------+------------+----------+----------+---------+
| Numpy installed by ↓ | Run from → |  Terminal  |  PyCharm | Terminal | PyCharm |
+----------------------+------------+------------+----------+----------+---------+
|          Apple Tensorflow         |   4.19151  |  4.86248 |     /    |    /    |
+-----------------------------------+------------+----------+----------+---------+
|        conda install numpy        |   4.29386  |  4.98370 |  4.10029 | 4.99271 |
+-----------------------------------+------------+----------+----------+---------+

从结果来看,与参考文献中的任何一个numpy版本相比,我的代码的计时要慢一些。

6yjfywim

6yjfywim1#

我在M1上也注意到了类似的速度下降,但我认为实际原因,至少在我的计算机上,不是根本错误的Numpy安装,而是基准测试本身的一些问题。

In [25]: from scipy import linalg

In [26]: a = np.random.randn(1000,100)

In [27]: %timeit a.T @ a
226 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [28]: x = a.T @ a

In [29]: %timeit linalg.eigh(x)
1.69 ms ± 88.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [30]: %timeit linalg.eigh(a.T @ a)
428 ms ± 99.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

计算x = a.T @ a; eigh(x)需要2毫秒,而eigh(a.T @ a)需要400毫秒。我认为在后一种情况下,%timeit存在一些问题。可能由于某种原因,计算被路由到“效率核心”?
我的初步回答是,您的第一个%timeit基准测试并不可靠。

41zrol4v

41zrol4v2#

如果您怀疑时间有问题,请尝试使用时间

import time
start = time.time()

# your numpy test here

took=time.time() - start
print("Test took "+str(took)+" seconds.")

更多关于苹果芯片上numpy的信息,请阅读下面链接中的第一个答案。为了获得最佳性能,建议使用苹果的加速vecLib。如果你使用conda安装,那么也可以查看@AndrejHribernik的评论:Why Python native on M1 Max is greatly slower than Python on old Intel i5?

相关问题