Cython执行速度与MSVC和GCC版本

omvjsjqw  于 2023-08-06  发布在  其他
关注(0)|答案(1)|浏览(148)

简介

我有一个相当简单的Cython模块--为了进行特定的测试,我进一步简化了它。
这个模块只有一个类,它只有一个有趣的方法(名为run):这个方法接受一个Fortran排序的2D NumPy数组和两个1D NumPy数组作为输入,并在这些数组上做一些非常非常简单的事情(代码见下文)。
为了进行基准测试,我在Windows 10 64位,Python 3.9.10 64位,Cython 0.29.32上编译了MSVC,GCC 8,GCC 11,GCC 12和GCC 13的完全相同的模块。我从优秀的布雷希特Sanders GitHub页面(https://github.com/brechtsanders)获得的所有GCC编译器。

主要问题

这篇很长的文章的首要问题是:我只是好奇,是否有人能解释为什么GCC 12和GCC 13比GCC 11慢得多(GCC 11是所有中最快的)。看起来GCC的每个版本的性能都在下降,而不是变得更好...

基准测试

在基准测试中,我只是改变了2D和1D数组的数组维度(mn)以及2D和1D数组中非零条目的数量。我对每个编译器版本、每个mn以及非零项集合重复run方法20次。
我正在使用的优化设置:

MVSC

MSVC_EXTRA_COMPILE_ARGS = ['/O2', '/GS-', '/fp:fast', '/Ob2', '/nologo', '/arch:AVX512', '/Ot', '/GL']

字符串

GCC

GCC_EXTRA_COMPILE_ARGS = ['-Ofast', '-funroll-loops', '-flto', '-ftree-vectorize', '-march=native', '-fno-asynchronous-unwind-tables']
GCC_EXTRA_LINK_ARGS = ['-flto'] + GCC_EXTRA_COMPILE_ARGS


我观察到的是以下情况:

  • MSVC在执行基准测试方面是最慢的(为什么在Windows上会这样?))
  • 进展GCC 8-> GCC 11是有希望的,因为GCC 11比GCC 8更快
  • GCC 12和GCC 13都比GCC 11慢得多,GCC 13是最差的(是GCC 11的两倍慢,比GCC 12差得多)。
    结果表:
  • 运行时间以毫秒(ms)为单位 *

x1c 0d1x的数据

图表(注:对数Y轴!!):

  • 运行时间以毫秒(ms)为单位 *


代码

Cython文件:

###############################################################################

import numpy as np
cimport numpy as np
import cython

from cython.view cimport array as cvarray

from libc.float cimport FLT_MAX

DTYPE_float = np.float32
ctypedef np.float32_t DTYPE_float_t

cdef float SMALL = 1e-10
cdef int MAXSIZE = 1000000

cdef extern from "math.h" nogil:
    cdef float fabsf(float x)

###############################################################################

@cython.final
cdef class CythonLoops:

    cdef int m, n
    cdef int [:] col_starts
    cdef int [:] row_indices
    cdef double [:] x

    def __cinit__(self):

        self.m = 0
        self.n = 0

        self.col_starts  = cvarray(shape=(MAXSIZE,), itemsize=sizeof(int), format='i')
        self.row_indices = cvarray(shape=(MAXSIZE,), itemsize=sizeof(int), format='i')
        self.x = cvarray(shape=(MAXSIZE,), itemsize=sizeof(double), format='d')

    @cython.boundscheck(False) # turn off bounds-checking for entire function
    @cython.wraparound(False)  # turn off negative index wrapping for entire function
    @cython.nonecheck(False)
    @cython.initializedcheck(False)
    @cython.cdivision(True)
    cpdef run(self, DTYPE_float_t[::1, :] matrix,
                    DTYPE_float_t[:]      ub_values,
                    DTYPE_float_t[:]      priority):

        cdef Py_ssize_t i, j, m, n
        cdef int nza, collen
        cdef double too_large, ok, obj
        cdef float ub, element

        cdef int [:] col_starts = self.col_starts
        cdef int [:] row_indices = self.row_indices
        cdef double [:] x = self.x

        m = matrix.shape[0]
        n = matrix.shape[1]

        self.m = m
        self.n = n

        nza = 0
        collen = 0
        for i in range(n):
            for j in range(m+1):
                if j == 0:
                    element = priority[i]
                else:
                    element = matrix[j-1, i]

                if fabsf(element) < SMALL:
                    continue

                if j == 0:
                    obj = <double>element
                    # Do action 1 with external library
                else:
                    collen = nza + 1
                    col_starts[collen] = i+1
                    row_indices[collen] = j
                    x[collen] = <double>element
                    nza += 1

            ub = ub_values[i]

            if ub > FLT_MAX:
                too_large = 0.0
                # Do action 2 with external library
            elif ub > SMALL:
                ok = <double>ub
                # Do action 3 with external library

        # Use x, row_indices and col_starts in the external library

设置文件:

我使用以下代码编译它:

python setup.py build_ext --inplace --compiler=mingw32 gcc13


其中最后一个参数是我想测试的编译器

#!/usr/bin/env python

from setuptools import setup
from setuptools import Extension

from Cython.Build import cythonize
from Cython.Distutils import build_ext

import numpy as np
import os
import shutil
import sys
import getpass

MODULE = 'loop_cython_%s'

GCC_EXTRA_COMPILE_ARGS = ['-Ofast', '-funroll-loops', '-flto', '-ftree-vectorize', '-march=native', '-fno-asynchronous-unwind-tables']
GCC_EXTRA_LINK_ARGS = ['-flto'] + GCC_EXTRA_COMPILE_ARGS

MSVC_EXTRA_COMPILE_ARGS = ['/O2', '/GS-', '/fp:fast', '/Ob2', '/nologo', '/arch:AVX512', '/Ot', '/GL']
MSVC_EXTRA_LINK_ARGS = MSVC_EXTRA_COMPILE_ARGS

def remove_builds(kind):

    for folder in ['build', 'bin']:
        if os.path.isdir(folder):
            if folder == 'bin':
                continue
            shutil.rmtree(folder, ignore_errors=True)

    if os.path.isfile(MODULE + '_%s.c'%kind):
        os.remove(MODULE + '_%s.c'%kind)

def setenv(extra_args, doset=True, path=None, kind='gcc8'):

    flags = ''
    if doset:
        flags = '  '.join(extra_args)

    for key in ['CFLAGS', 'FFLAGS', 'CPPFLAGS']:
        os.environ[key] = flags

    user = getpass.getuser()

    if doset:
        path = os.environ['PATH']
        if kind == 'gcc8':
            os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64_8.0\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
        elif kind == 'gcc11':
            os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
        elif kind == 'gcc12':
            os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64_12.2.0\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
        elif kind == 'gcc13':
            os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64_13.2.0\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
        elif kind == 'msvc':
            os.environ['PATH'] = r'C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\bin\Hostx64\x64;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\J0514162\WinPython39\WPy64-39100\python-3.9.10.amd64;C:\Program Files (x86)\Windows Kits\10\bin\10.0.22000.0\x64'
            os.environ['LIB'] = r'C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\lib\x64;C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22000.0\um\x64;C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22000.0\ucrt\x64'
            os.environ["DISTUTILS_USE_SDK"] = '1'
            os.environ["MSSdk"] = '1'

    else:
        os.environ['PATH'] = path

    return path

class CustomBuildExt(build_ext):
    def build_extensions(self):
        # Override the compiler executables. Importantly, this
        # removes the "default" compiler flags that would
        # otherwise get passed on to to the compiler, i.e.,
        # distutils.sysconfig.get_var("CFLAGS").
        self.compiler.set_executable("compiler_so", "gcc -mdll -O -Wall -DMS_WIN64")
        self.compiler.set_executable("compiler_cxx", "g++ -O -Wall -DMS_WIN64")
        self.compiler.set_executable("linker_so", "gcc -shared -static")
        self.compiler.dll_libraries = []
        build_ext.build_extensions(self)

if __name__ == '__main__':

    os.system('cls')

    kind = None
    for arg in sys.argv:
        if arg.strip() in ['gcc8', 'gcc11', 'gcc12', 'gcc13', 'msvc']:
            kind = arg
            sys.argv.remove(arg)
            break

    base_file = os.path.join(os.getcwd(), MODULE[0:-3])

    source = base_file + '.pyx'
    target = base_file + '_%s.pyx'%kind
    shutil.copyfile(source, target)

    if kind == 'msvc':
        extra_compile_args = MSVC_EXTRA_COMPILE_ARGS[:]
        extra_link_args = MSVC_EXTRA_LINK_ARGS[:] + ['/MANIFEST']
    else:
        extra_compile_args = GCC_EXTRA_COMPILE_ARGS[:]
        extra_link_args = GCC_EXTRA_LINK_ARGS[:]

    path = setenv(extra_compile_args, kind=kind)
    remove_builds(kind)

    define_macros = [('WIN32', 1)]

    nname = MODULE%kind

    include_dirs = [np.get_include()]
    if kind == 'msvc':
        include_dirs += [r'C:\Program Files (x86)\Windows Kits\10\Include\10.0.22000.0\ucrt',
                         r'C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\include',
                         r'C:\Program Files (x86)\Windows Kits\10\Include\10.0.22000.0\shared']

    extensions = [
        Extension(nname, [nname + '.pyx'],
                  extra_compile_args=extra_compile_args,
                  extra_link_args=extra_link_args,
                  include_dirs=include_dirs,
                  define_macros=define_macros)]

    # build the core extension(s)
    setup_kwargs = {'ext_modules': cythonize(extensions,
                                             compiler_directives={'embedsignature'  : False,
                                                                  'boundscheck'     : False,
                                                                  'wraparound'      : False,
                                                                  'initializedcheck': False,
                                                                  'cdivision'       : True,
                                                                  'language_level'  : '3str',
                                                                  'nonecheck'       : False},
                                             force=True,
                                             cache=False,
                                             quiet=False)}

    if kind != 'msvc':
        setup_kwargs['cmdclass'] = {'build_ext': CustomBuildExt}

    setup(**setup_kwargs)

    setenv([], False, path)
    remove_builds(kind)

测试代码:

import os
import numpy
import time

import loop_cython_msvc as msvc
import loop_cython_gcc8 as gcc8
import loop_cython_gcc11 as gcc11
import loop_cython_gcc12 as gcc12
import loop_cython_gcc13 as gcc13

              # M     N     NNZ(matrix) NNZ(priority) NNZ(ub)
DIMENSIONS = [(1661 , 2608 , 3560      , 375         , 2488 ),
              (2828 , 3512 , 4333      , 413         , 2973 ),
              (780  , 985  , 646       , 23          , 984  ),
              (799  , 1558 , 1883      , 301         , 1116 ),
              (399  , 540  , 388       , 44          , 517  ),
              (10545, 10486, 14799     , 1053        , 10041),
              (3369 , 3684 , 3684      , 256         , 3242 ),
              (2052 , 5513 , 4772      , 1269        , 3319 ),
              (224  , 628  , 1345      , 396         , 594  ),
              (553  , 1475 , 1315      , 231         , 705  )]

def RunTest():

    print('M      N      NNZ     MSVC     GCC 8    GCC 11   GCC 12   GCC 13')

    for m, n, nnz_mat, nnz_priority, nnz_ub in DIMENSIONS:
        print('%-6d %-6d %-8d'%(m, n, nnz_mat), end='')

        for solver, label in zip([msvc, gcc8, gcc11, gcc12, gcc13], ['MSVC', 'GCC 8', 'GCC 11', 'GCC 12', 'GCC 13']):

            numpy.random.seed(123456)
            size = m*n
            idxes = numpy.arange(size)
            matrix = numpy.zeros((size, ), dtype=numpy.float32)
            idx_mat = numpy.random.choice(idxes, nnz_mat)
            matrix[idx_mat] = numpy.random.uniform(0, 1000, size=(nnz_mat, ))

            matrix = numpy.asfortranarray(matrix.reshape((m, n)))

            idxes = numpy.arange(m)
            priority = numpy.zeros((m, ), dtype=numpy.float32)
            idx_pri = numpy.random.choice(idxes, nnz_priority)
            priority[idx_pri] = numpy.random.uniform(0, 1000, size=(nnz_priority, ))

            idxes = numpy.arange(n)
            ub_values = numpy.inf*numpy.ones((n, ), dtype=numpy.float32)
            idx_ub = numpy.random.choice(idxes, nnz_ub)
            ub_values[idx_ub] = numpy.random.uniform(0, 1000, size=(nnz_ub, ))

            solver = solver.CythonLoops()
            time_tot = []

            for i in range(20):

                start = time.perf_counter()
                solver.run(matrix, ub_values, priority)
                elapsed = time.perf_counter() - start
                time_tot.append(elapsed*1e3)

            print('%-8.4g'%numpy.mean(time_tot), end=' ')

        print()

if __name__ == '__main__':

    os.system('cls')
    RunTest()

编辑

在@PeterCordes评论之后,我将优化标志更改为:

MSVC_EXTRA_COMPILE_ARGS = ['/O2', '/GS-', '/fp:fast', '/Ob2', '/nologo', '/arch:AVX512', '/Ot', '/GL', '/QIntel-jcc-erratum']
GCC_EXTRA_COMPILE_ARGS = ['-Ofast', '-funroll-loops', '-flto', '-ftree-vectorize', '-march=native', '-fno-asynchronous-unwind-tables', '-Wa,-mbranches-within-32B-boundaries']

MSVC看起来比以前稍微快一些(在5%和10%之间),但GCC 12和GCC 13比以前慢(在3%和20%之间)。下面是最大2D矩阵的结果图:

注意:“当前”表示使用@PeterCordes建议的最新优化标志,“上一个”为原始标志集。


4c8rllxm

4c8rllxm1#

这是一个部分答案,提供了由Cython生成的C代码,一旦它被简化了一点,变得更短,更容易阅读,并且易于编译,而没有任何Cython,Python或NumPy依赖关系(转换预计不会严重影响时间)。它还显示了生成的汇编代码和一些注解。
下面是C代码:

#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#include <float.h>

/* Remove external dependencies */
typedef int64_t Py_ssize_t;
typedef void memoryview_obj;
typedef void* VTable;

typedef struct {
  struct memoryview_obj *memview;
  char *data;
  Py_ssize_t shape[8];
  Py_ssize_t strides[8];
  Py_ssize_t suboffsets[8];
} memviewslice;

struct Self
{
    /* PyObject_HEAD */
    struct VTable* tab;
    int m;
    int n;
    memviewslice col_starts;
    memviewslice row_indices;
    memviewslice x;
};

static float test_SMALL = 1e-10;

extern void INC_MEMVIEW(memviewslice* slice, int count);
extern void XDEC_MEMVIEW(memviewslice* slice, int count);

void run(struct Self* self, memviewslice matrix, memviewslice ub_values, memviewslice priority)
{
    Py_ssize_t i;
    Py_ssize_t j;
    Py_ssize_t m;
    Py_ssize_t n;
    int nza;
    int collen;
    double too_large;
    double ok;
    double obj;
    float ub;
    float element;
    memviewslice col_starts = { 0, 0, { 0 }, { 0 }, { 0 } };
    memviewslice row_indices = { 0, 0, { 0 }, { 0 }, { 0 } };
    memviewslice x = { 0, 0, { 0 }, { 0 }, { 0 } };
    memviewslice t_1 = { 0, 0, { 0 }, { 0 }, { 0 } };
    memviewslice t_2 = { 0, 0, { 0 }, { 0 }, { 0 } };
    Py_ssize_t t_3;
    Py_ssize_t t_4;
    Py_ssize_t t_6;
    Py_ssize_t t_7;
    int t_9;
    Py_ssize_t t_10;
    Py_ssize_t t_11;

    t_1 = self->col_starts;
    INC_MEMVIEW(&t_1, 1);
    col_starts = t_1;
    t_1.memview = NULL;
    t_1.data = NULL;

    t_1 = self->row_indices;
    INC_MEMVIEW(&t_1, 1);
    row_indices = t_1;
    t_1.memview = NULL;
    t_1.data = NULL;

    t_2 = self->x;
    INC_MEMVIEW(&t_2, 1);
    x = t_2;
    t_2.memview = NULL;
    t_2.data = NULL;

    m = matrix.shape[0];
    n = matrix.shape[1];
    self->m = m;
    self->n = n;
    nza = 0;
    collen = 0;

    t_3 = n;
    t_4 = t_3;
    for (i = 0; i < t_4; i++)
    {
        t_6 = m + 1;
        t_7 = t_6;

        for (j = 0; j < t_7; j++)
        {
            t_9 = (j == 0) != 0;

            if (t_9)
            {
                t_10 = i;
                element = (*((float*) ( /* dim=0 */ (priority.data + t_10 * priority.strides[0]) )));

                goto L7;
            }

            {
                t_10 = j - 1;
                t_11 = i;
                element = (*((float*) ( /* dim=1 */ (( /* dim=0 */ ((char *) (((float*) matrix.data) + t_10)) ) + t_11 * matrix.strides[1]) )));
            }

            L7:;

            t_9 = (fabsf(element) < test_SMALL) != 0;
            if (t_9)
            {
                goto L5_continue;
            }

            t_9 = ((j == 0) != 0);
            if (t_9)
            {

                obj = (double)element;

                goto L9;
            }

            {
                collen = nza + 1;

                t_11 = collen;
                *((int *) ( /* dim=0 */ (col_starts.data + t_11 * col_starts.strides[0]) )) = i + 1;

                t_11 = collen;
                *((int *) ( /* dim=0 */ (row_indices.data + t_11 * row_indices.strides[0]) )) = j;

                t_11 = collen;
                *((double *) ( /* dim=0 */ (x.data + t_11 * x.strides[0]) )) = ((double)element);

                nza = nza + 1;
            }

            L9:;
            L5_continue:;
        }

        t_11 = i;
        ub = (*((float*) ( /* dim=0 */ (ub_values.data + t_11 * ub_values.strides[0]) )));

        t_9 = (ub > FLT_MAX) != 0;
        if (t_9)
        {
            too_large = 0.0;
            goto L10;
        }

        t_9 = (ub > test_SMALL) != 0;
        if (t_9)
        {
            ok = (double)ub;
        }

        L10:;
    }

    XDEC_MEMVIEW(&col_starts, 1);
    XDEC_MEMVIEW(&row_indices, 1);
    XDEC_MEMVIEW(&x, 1);
}

字符串
下面是使用GCC 11.4(带有编译器选项-Ofast -funroll-loops -ftree-vectorize -march=skylake -fno-asynchronous-unwind-tables)在汇编中编译的主循环:

.L4:
        lea     rdi, [r12-1]
        xor     eax, eax
        and     edi, 3
        je      .L5
        vmovss  xmm4, DWORD PTR [rcx]
        mov     eax, 1
        vandps  xmm5, xmm4, xmm1
        vcomiss xmm9, xmm5
        ja      .L25
        inc     edx
        movsx   r14, edx
        mov     r15, r9
        imul    r15, r14
        vcvtss2sd       xmm6, xmm4, xmm4
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, r14
        imul    r14, rbp
        mov     DWORD PTR [r10+r15], 1
        vmovsd  QWORD PTR [rbx+r14], xmm6
.L25:
        cmp     rdi, 1
        je      .L5
        cmp     rdi, 2
        je      .L26
        inc     rax
        vmovss  xmm7, DWORD PTR [rcx-4+rax*4]
        vandps  xmm2, xmm7, xmm1
        vcomiss xmm9, xmm2
        ja      .L26
        inc     edx
        movsx   rdi, edx
        mov     r14, r9
        mov     r15, r11
        imul    r14, rdi
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm3, xmm7, xmm7
        mov     DWORD PTR [r10+r15], eax
        vmovsd  QWORD PTR [rbx+rdi], xmm3
.L26:
        inc     rax
        vmovss  xmm6, DWORD PTR [rcx-4+rax*4]
        vandps  xmm8, xmm6, xmm1
        vcomiss xmm9, xmm8
        jbe     .L36
        jmp     .L5
.L7:
        vmovss  xmm10, DWORD PTR [rcx-4+rax*4]
        vandps  xmm11, xmm10, xmm1
        vcomiss xmm9, xmm11
        ja      .L24
        inc     edx
        movsx   rdi, edx
        mov     r14, r9
        mov     r15, r11
        imul    r14, rdi
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm12, xmm10, xmm10
        mov     DWORD PTR [r10+r15], eax
        vmovsd  QWORD PTR [rbx+rdi], xmm12
.L24:
        lea     r14, [rax+1]
        vmovss  xmm13, DWORD PTR [rcx-4+r14*4]
        vandps  xmm14, xmm13, xmm1
        vcomiss xmm9, xmm14
        ja      .L27
        inc     edx
        movsx   rdi, edx
        mov     r15, r9
        imul    r15, rdi
        vcvtss2sd       xmm15, xmm13, xmm13
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r10+r15], r14d
        vmovsd  QWORD PTR [rbx+rdi], xmm15
.L27:
        lea     r14, [rax+2]
        vmovss  xmm0, DWORD PTR [rcx-4+r14*4]
        vandps  xmm4, xmm0, xmm1
        vcomiss xmm9, xmm4
        ja      .L28
        inc     edx
        movsx   rdi, edx
        mov     r15, r9
        imul    r15, rdi
        vcvtss2sd       xmm5, xmm0, xmm0
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r10+r15], r14d
        vmovsd  QWORD PTR [rbx+rdi], xmm5
.L28:
        add     rax, 3
        vmovss  xmm6, DWORD PTR [rcx-4+rax*4]
        vandps  xmm7, xmm6, xmm1
        vcomiss xmm9, xmm7
        ja      .L5
.L36:
        inc     edx
        movsx   rdi, edx
        mov     r14, r9
        mov     r15, r11
        imul    r14, rdi
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm2, xmm6, xmm6
        mov     DWORD PTR [r10+r15], eax
        vmovsd  QWORD PTR [rbx+rdi], xmm2
.L5:
        inc     rax
        cmp     r12, rax
        jne     .L7
        lea     rax, [rsi+1]
        add     rcx, QWORD PTR [rsp+8]
        cmp     r13, rsi
        je      .L2
        mov     rsi, rax
        jmp     .L4
.L2:


GCC 12.2

.L4:
        lea     rdi, [r12-1]
        xor     eax, eax
        and     edi, 3
        je      .L5
        vmovss  xmm15, DWORD PTR [rcx]
        mov     eax, 1
        vandps  xmm4, xmm15, xmm1
        vcomiss xmm0, xmm4
        ja      .L27
        inc     edx
        movsx   r14, edx
        mov     r15, r9
        imul    r15, r14
        vcvtss2sd       xmm5, xmm15, xmm15
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, r14
        imul    r14, rbp
        mov     DWORD PTR [r10+r15], 1
        vmovsd  QWORD PTR [rbx+r14], xmm5
.L27:
        cmp     rdi, 1
        je      .L5
        cmp     rdi, 2
        je      .L28
        inc     rax
        vmovss  xmm6, DWORD PTR [rcx-4+rax*4]
        vandps  xmm7, xmm6, xmm1
        vcomiss xmm0, xmm7
        ja      .L28
        inc     edx
        movsx   rdi, edx
        mov     r14, r9
        mov     r15, r11
        imul    r14, rdi
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm2, xmm6, xmm6
        mov     DWORD PTR [r10+r15], eax
        vmovsd  QWORD PTR [rbx+rdi], xmm2
.L28:
        inc     rax
        vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
        vandps  xmm3, xmm5, xmm1
        vcomiss xmm0, xmm3
        jbe     .L39
        jmp     .L5
.L41:
        vmovss  xmm8, DWORD PTR [rcx-4+rax*4]
        vandps  xmm9, xmm8, xmm1
        vcomiss xmm0, xmm9
        ja      .L26
        inc     edx
        movsx   r15, edx
        mov     r14, r9
        mov     rdi, r11
        imul    r14, r15
        imul    rdi, r15
        imul    r15, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm10, xmm8, xmm8
        mov     DWORD PTR [r10+rdi], eax
        vmovsd  QWORD PTR [rbx+r15], xmm10
.L26:
        lea     r14, [rax+1]
        vmovss  xmm11, DWORD PTR [rcx-4+r14*4]
        vandps  xmm12, xmm11, xmm1
        vcomiss xmm0, xmm12
        ja      .L29
        inc     edx
        movsx   rdi, edx
        mov     r15, r9
        imul    r15, rdi
        vcvtss2sd       xmm13, xmm11, xmm11
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r10+r15], r14d
        vmovsd  QWORD PTR [rbx+rdi], xmm13
.L29:
        lea     r14, [rax+2]
        vmovss  xmm14, DWORD PTR [rcx-4+r14*4]
        vandps  xmm15, xmm14, xmm1
        vcomiss xmm0, xmm15
        ja      .L30
        inc     edx
        movsx   rdi, edx
        mov     r15, r9
        imul    r15, rdi
        vcvtss2sd       xmm4, xmm14, xmm14
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r10+r15], r14d
        vmovsd  QWORD PTR [rbx+rdi], xmm4
.L30:
        add     rax, 3
        vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
        vandps  xmm6, xmm5, xmm1
        vcomiss xmm0, xmm6
        ja      .L5
.L39:
        inc     edx
        movsx   rdi, edx
        mov     r14, r9
        mov     r15, r11
        imul    r14, rdi
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm7, xmm5, xmm5
        mov     DWORD PTR [r10+r15], eax
        vmovsd  QWORD PTR [rbx+rdi], xmm7
.L5:
        inc     rax
        cmp     r12, rax
        jne     .L41
        mov     rdi, QWORD PTR [rsp+8]
        lea     rax, [rsi+1]
        add     rcx, rdi
        cmp     rsi, r13
        je      .L2
        mov     rsi, rax
        jmp     .L4
.L2:


以下是GCC 13.1的情况(GCC 13.2Godbolt上不可用):

.L4:
        mov     QWORD PTR [rsp+16], r14
        xor     eax, eax
.L20:
        mov     r8, rax
        sub     r8, r9
        not     r8
        and     r8d, 3
        je      .L6
        inc     rax
        vmovss  xmm15, DWORD PTR [rcx-4+rax*4]
        vandps  xmm0, xmm15, xmm2
        vcomiss xmm1, xmm0
        ja      .L20
        inc     edx
        mov     r15, r10
        mov     rdi, QWORD PTR [rsp+24]
        vcvtss2sd       xmm4, xmm15, xmm15
        movsx   r14, edx
        imul    r15, r14
        mov     DWORD PTR [rdi+r15], esi
        mov     r15, rbx
        imul    r15, r14
        imul    r14, r13
        mov     DWORD PTR [r11+r15], eax
        vmovsd  QWORD PTR [r12+r14], xmm4
        cmp     r8, 1
        je      .L6
        cmp     r8, 2
        je      .L21
        inc     rax
        vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
        vandps  xmm6, xmm5, xmm2
        vcomiss xmm1, xmm6
        ja      .L20
        inc     edx
        mov     r14, r10
        vcvtss2sd       xmm7, xmm5, xmm5
        movsx   r8, edx
        imul    r14, r8
        mov     DWORD PTR [rdi+r14], esi
        mov     rdi, rbx
        imul    rdi, r8
        imul    r8, r13
        mov     DWORD PTR [r11+rdi], eax
        vmovsd  QWORD PTR [r12+r8], xmm7
.L21:
        inc     rax
        vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
        vandps  xmm3, xmm5, xmm2
        vcomiss xmm1, xmm3
        ja      .L20
        inc     edx
        mov     r14, r10
        mov     rdi, QWORD PTR [rsp+24]
        movsx   r8, edx
        imul    r14, r8
.L32:
        mov     r15, rbx
        mov     DWORD PTR [rdi+r14], esi
        vcvtss2sd       xmm8, xmm5, xmm5
        imul    r15, r8
        imul    r8, r13
        mov     DWORD PTR [r11+r15], eax
        vmovsd  QWORD PTR [r12+r8], xmm8
.L6:
        lea     r8, [rax+1]
        mov     rax, r8
        cmp     r9, r8
        je      .L7
        vmovss  xmm9, DWORD PTR [rcx-4+r8*4]
        vandps  xmm10, xmm9, xmm2
        vcomiss xmm1, xmm10
        ja      .L20
        inc     edx
        mov     r15, r10
        mov     rdi, QWORD PTR [rsp+24]
        vcvtss2sd       xmm11, xmm9, xmm9
        movsx   r14, edx
        mov     DWORD PTR [rsp+4], edx
        imul    r15, r14
        mov     DWORD PTR [rdi+r15], esi
        mov     r15, rbx
        imul    r15, r14
        imul    r14, r13
        mov     DWORD PTR [r11+r15], eax
        inc     rax
        vmovss  xmm12, DWORD PTR [rcx-4+rax*4]
        vmovsd  QWORD PTR [r12+r14], xmm11
        vandps  xmm13, xmm12, xmm2
        vcomiss xmm1, xmm13
        ja      .L20
        inc     edx
        mov     r15, r10
        vcvtss2sd       xmm14, xmm12, xmm12
        movsx   r14, edx
        imul    r15, r14
        mov     DWORD PTR [rdi+r15], esi
        mov     r15, rbx
        imul    r15, r14
        imul    r14, r13
        mov     DWORD PTR [r11+r15], eax
        lea     rax, [r8+2]
        vmovss  xmm15, DWORD PTR [rcx-4+rax*4]
        vmovsd  QWORD PTR [r12+r14], xmm14
        vandps  xmm0, xmm15, xmm2
        vcomiss xmm1, xmm0
        ja      .L20
        mov     edx, DWORD PTR [rsp+4]
        mov     r15, r10
        vcvtss2sd       xmm4, xmm15, xmm15
        add     edx, 2
        movsx   r14, edx
        imul    r15, r14
        mov     DWORD PTR [rdi+r15], esi
        mov     r15, rbx
        imul    r15, r14
        imul    r14, r13
        mov     DWORD PTR [r11+r15], eax
        lea     rax, [r8+3]
        vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
        vmovsd  QWORD PTR [r12+r14], xmm4
        vandps  xmm6, xmm5, xmm2
        vcomiss xmm1, xmm6
        ja      .L20
        mov     edx, DWORD PTR [rsp+4]
        mov     r14, r10
        add     edx, 3
        movsx   r8, edx
        imul    r14, r8
        jmp     .L32
.L7:
        mov     rdi, QWORD PTR [rsp+8]
        mov     r14, QWORD PTR [rsp+16]
        lea     rax, [rsi+1]
        add     rcx, rdi
        cmp     r14, rsi
        je      .L2
        mov     rsi, rax
        jmp     .L4
.L2:

讨论

首先,我们可以看到主循环中充满了与goto混合的条件分支。这样的代码通常不是由人类编写的。编译器可以预期它们是罕见的,因此优化它们的效率低于没有goto的Assert(因为Assert更容易计算)。问题是j == 0条件在这里并不真正有用:在基于j的循环之前,它只能被执行一次。GCC似乎没有看到这一点(特别是由于复杂的控制流,由Cython goto指令放大)。编译器不知道fabsf(element) < SMALL是否经常为真或经常为假,甚至在运行时很难预测。如果循环很容易预测(前两种情况),那么一个好的编译器可以写出比GCC更有效的代码(可能使用SIMD指令)。
实际上,由于复杂的控制流,我甚至不确定-funroll-loops在这里是否有用。这使得热循环显著更大(即,高速缓存中的更多空间),并且处理器可能花费更多时间来预测采用/未采用分支的概率。这取决于目标处理器。拥有更小的代码也有利于分析和优化代码:更简单的代码更容易优化。
为了优化代码,编译器对循环的成本和要采用的分支的概率做出假设。例如,可以假设内部循环比外部循环昂贵得多。启发式被用来优化代码,并且它们的参数被仔细地调整,以便最大化基准测试套件的性能(并避免代码中的回归,如GCC的Linux内核)。因此,一个副作用是,您的代码可能会更慢,而大多数代码使用较新版本的GCC或大多数编译器可以更快。这也是给定编译器的性能从一个版本到另一个版本不是非常稳定的一个原因。例如,如果一个矩阵有很多零,那么fabsf(element) < SMALL通常为真,导致外循环比预期的更昂贵。问题是外部循环在GCC 13.2中的优化可能不如GCC 12.2,导致GCC 13.2的执行时间较慢。例如,GCC 13.2使用以下命令启动外部循环:

mov     r8, rax
        sub     r8, r9
        not     r8
        and     r8d, 3
        je      .L6


请注意,现代处理器(如Skylake)可以在每个周期并行执行多条指令。由于r8寄存器上的依赖关系链,此指令序列只能按顺序执行。与此同时,GCC 12.2在外部循环的开始生成以下内容:

lea     rdi, [r12-1]
        xor     eax, eax
        and     edi, 3
        je      .L5


在这种情况下,依赖链(在rdi/edi上)显然要小得多。在实践中,汇编代码相当大,所以我不认为这是唯一的问题。需要跟踪所有GCC版本中输入矩阵的实际路径(非常繁琐),或者只是运行代码并使用VTune等工具对其进行分析(本例中推荐的解决方案)。
我们可以看到有很多imul指令。这样的指令通常是由于在编译时不知道的数组的步幅。GCC不会为步长为1的情况生成代码。如果可能的话,你真的应该在内存视图中使用::1提到这个信息(正如ead在评论中提到的),因为imul指令在这样的热循环中往往有点昂贵(更不用说它们还可以增加代码的大小,由于部分展开,代码已经很大了)。

相关问题