这是上一个问题的延续，数据是floatMatrix，其维数为（750000，4）。函数testFunction[x,y,w,z]是一个4个变量的多项式函数，它返回一个4D向量，并应用于所有750000个向量。
对于Mathematica，我使用了一个列表Compile和Parallelization->False。

Compile[{{f, _Real, 1}}, 
 {
   {0.011904761904761973` f[[2]]f[[1]]^3 + 
    0.002976190476190474` f[[1]]f[[2]]^3 - 0.020833333333333325` f[[3]] + 
    0.002976190476190474` f[[3]]^3 + 
    f[[2]]^2 (0.0029761904761904778` f[[3]] +...
   {0.002976190476190483` f[[1]]^3 + 0.011904761904761906` f[[2]]^3 - 
    0.0875` f[[3]] + 0.0029761904761904765` f[[3]]^3 + 
    f[[1]]^2 (0.005952380952380952` f[[2]] +...
 },CompilationTarget -> "C", RuntimeAttributes -> {Listable}, 
  Parallelization -> True];

time = RepeatedTiming[testFunction[floatMatrix]];
Print["In Mathematica-C it takes an average of ", time[[1]], " secs."]

对于Python，我使用NumPy。我有两个实现：
实现1（内部转置）：

def testFunction(data):
    f1, f2, f3, f4 = data.T
    
    results = np.zeros((data.shape[0], 4))  # Initialize a results array

    results[:, 0] = (0.011904761904761973*f2*f1**3 + 0.002976190476190474*f1*f2**3 - 
                    0.020833333333333325*f3 + 0.002976190476190474*f3**3 + f2**2* 
                    (0.0029761904761904778*f3 +...
    results[:, 1] = (0.002976190476190483*f1**3 + 0.011904761904761906*f2**3 - 0.0875*f3
                   + 0.0029761904761904765*f3**3 + f1**2*(0.005952380952380952*f2 + 
                     0.002976190476190469*f3 + 0.0029761904761904726*f4) +...
    return results

duration=0

for i in range(10):
    start_time = time.time()
    testFunction(floatMatrix)
    end_time = time.time()
    duration = duration + end_time - start_time

duration=duration*0.1

print(f"With numpy it takes an average of': {duration} seconds")

实现2（外转）：

def testFunction(f1, f2, f3, f4):
    results = np.zeros((f1.shape[0], 4))  # Initialize a results array

    results[:, 0] = ...
    results[:, 1] = ...
    results[:, 2] = ...
    results[:, 3] = ...

    return results

# Transpose the data outside the function
f1, f2, f3, f4 = floatMatrix.T

duration=0

for i in range(10):
    start_time = time.time()
    testFunction(f1, f2, f3, f4)
    end_time = time.time()
    duration = duration + end_time - start_time

duration=duration*0.1

print(f"With numpy vectorized operations it takes an average of': {duration} seconds")

这是时代：

Mathematica：0.119938秒
Numpy（内部转置）：0.206754秒
Numpy（外部换位）：0.20789377秒

我本来以为NumPy会更快。我可以对Python中的代码进行任何更改，使其比Mathematica编译的函数更快吗？

有几种方法可以让这个NumPy代码更快。一种是打开Numba，这是一种使用编译器来加速NumPy繁重代码的优化器。

import numba as nb

@nb.njit()
def test_function():
    ...

这样做，我得到了6倍的加速。
接下来，我试着打开快速数学。这是一种允许Numba改变某些浮点运算速度的模式。（例如，通常不允许将（a+B）+c重写为a+（B+c），但在fastmath下，这是允许的转换。

import numba as nb

@nb.njit(fastmath=True):
def test_function()
    ...

这提供了3%的加速。
接下来，我注意到，相对于Mathematica执行计算的顺序，执行计算的顺序具有较差的内存局部性。Mathematica一次计算一个向量，但NumPy一次计算表达式的一个分量，这需要它多次将f1列加载到内存中。
为了解决这个问题，我引入了一个显式循环。（详细信息请参见代码部分。）这使我的速度又提高了44%。
结合这些优化，我得到了9.5倍于原始Python实现的加速。（我没有对Mathematica代码进行基准测试，因为我没有副本。

代码

原始（加上修复语法错误的修复程序）

floatMatrix = np.random.rand(1200000, 4)
def test_function_transpose_inside(data):
    f1, f2, f3, f4 = data.T
    
    results = np.zeros((data.shape[0], 4))  # Initialize a results array

    results[:, 0] = (0.011904761904761973*f2*f1**3 + 0.002976190476190474*f1*f2**3 - 
                    0.020833333333333325*f3 + 0.002976190476190474*f3**3 + f2**2* 
                    (0.0029761904761904778*f3))
    results[:, 1] = (0.002976190476190483*f1**3 + 0.011904761904761906*f2**3 - 0.0875*f3
                   + 0.0029761904761904765*f3**3 + f1**2*(0.005952380952380952*f2 + 
                     0.002976190476190469*f3 + 0.0029761904761904726*f4))
    return results

关于Numba

import numba as nb

@nb.njit()
def test_function_numba(data):
    f1, f2, f3, f4 = data.T
    
    results = np.zeros((data.shape[0], 4))  # Initialize a results array

    results[:, 0] = (0.011904761904761973*f2*f1**3 + 0.002976190476190474*f1*f2**3 - 
                    0.020833333333333325*f3 + 0.002976190476190474*f3**3 + f2**2* 
                    (0.0029761904761904778*f3))
    results[:, 1] = (0.002976190476190483*f1**3 + 0.011904761904761906*f2**3 - 0.0875*f3
                   + 0.0029761904761904765*f3**3 + f1**2*(0.005952380952380952*f2 + 
                     0.002976190476190469*f3 + 0.0029761904761904726*f4))
    return results

Numba和Fastmath

import numba as nb

@nb.njit(fastmath=True)
def test_function_numba_fastmath(data):
    f1, f2, f3, f4 = data.T
    
    results = np.zeros((data.shape[0], 4))  # Initialize a results array

    results[:, 0] = (0.011904761904761973*f2*f1**3 + 0.002976190476190474*f1*f2**3 - 
                    0.020833333333333325*f3 + 0.002976190476190474*f3**3 + f2**2* 
                    (0.0029761904761904778*f3))
    results[:, 1] = (0.002976190476190483*f1**3 + 0.011904761904761906*f2**3 - 0.0875*f3
                   + 0.0029761904761904765*f3**3 + f1**2*(0.005952380952380952*f2 + 
                     0.002976190476190469*f3 + 0.0029761904761904726*f4))
    return results

Numba，Fastmath和一个循环

import numba as nb

@nb.njit(fastmath=True)
def test_function_numba_loop(data):
    results = np.zeros((data.shape[0], 4))  # Initialize a results array
    for i in range(data.shape[0]):
        f1, f2, f3, f4 = data[i]
        results[i, 0] = (0.011904761904761973*f2*f1**3 + 0.002976190476190474*f1*f2**3 - 
                        0.020833333333333325*f3 + 0.002976190476190474*f3**3 + f2**2* 
                        (0.0029761904761904778*f3))
        results[i, 1] = (0.002976190476190483*f1**3 + 0.011904761904761906*f2**3 - 0.0875*f3
                       + 0.0029761904761904765*f3**3 + f1**2*(0.005952380952380952*f2 + 
                         0.002976190476190469*f3 + 0.0029761904761904726*f4))
    return results

基准测试结果

%timeit test_function_transpose_inside(floatMatrix)
215 ms ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_function_numba(floatMatrix)
33.6 ms ± 344 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_function_numba_fastmath(floatMatrix)
32.6 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_function_numba_loop(floatMatrix)
22.5 ms ± 68.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

进一步优化

如果results的每个元素都至少被写入一次，那么可以用np.empty((data.shape[0], 4))替换np.zeros((data.shape[0], 4))。在我的测试中，这得到了额外的10%的加速。

Numpy比Mathematica慢？

1条答案

代码

基准测试结果

进一步优化

相关问题

热门标签

最新问答