cmake 为什么在GCC中使用std::execution没有看到速度的提高?

ocebsuys  于 2023-08-05  发布在  其他
关注(0)|答案(1)|浏览(188)

我有这样的代码来测试速度提高使用std::execution库在Windows 10上:

#include <stddef.h>
#include <stdio.h>

#include <algorithm>
#include <chrono>
#include <execution>
#include <random>
#include <ratio>
#include <vector>

using std::milli;
using std::random_device;
using std::sort;
using std::vector;
using std::chrono::duration;
using std::chrono::duration_cast;
using std::chrono::high_resolution_clock;

const size_t testSize = 1'000'000;
const int iterationCount = 5;

void print_results(                                 //
    const char* const tag,                          //
    const vector<double>& sorted,                   //
    high_resolution_clock::time_point startTime,    //
    high_resolution_clock::time_point endTime
    //
)
{
    printf("%s: Lowest: %g Highest: %g Time: %f ms\n", tag, sorted.front(), sorted.back(),
           duration_cast<duration<double, milli>>(endTime - startTime).count());
}

int main()
{
    random_device rd;

    printf("Testing with %llu doubles...\n", testSize);
    vector<double> doubles(testSize);
    for (auto& d : doubles)
    {
        d = static_cast<double>(rd());
    }

    for (size_t i = 0; i < iterationCount; ++i)
    {
        vector<double> sorted(doubles);
        const auto startTime = high_resolution_clock::now();
        sort(sorted.begin(), sorted.end());
        const auto endTime = high_resolution_clock::now();
        print_results("Serial STL", sorted, startTime, endTime);
    }

    for (size_t i = 0; i < iterationCount; ++i)
    {
        vector<double> sorted(doubles);
        const auto startTime = high_resolution_clock::now();
        std::sort(std::execution::par, sorted.begin(), sorted.end());
        const auto endTime = high_resolution_clock::now();
        print_results("Parallel STL", sorted, startTime, endTime);
    }
    return 0;
}

字符串
我使用cmake和Ninja/MVSC作为生成器编译了这段代码。
下面是CMakeLists.txt代码:

cmake_minimum_required(VERSION 3.14.0)
project(EXEC VERSION 0.0.1)

set(CMAKE_C_STANDARD 17)
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

add_executable(
    executionTests
    targets/executionTests.cpp
)

if(CMAKE_CXX_COMPILER_ID MATCHES "GNU")
    target_compile_options(
        executionTests
        PRIVATE
        -O3
    )
elseif(CMAKE_CXX_COMPILER_ID MATCHES "MSVC")
    STRING(REGEX REPLACE "/RTC(su|[1su])" "" CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG}")
    STRING(REGEX REPLACE "/RTC(su|[1su])" "" CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG}")
    target_compile_options(
        executionTests
        PRIVATE
        /O2
    )
endif()


和配置/构建脚本:

# Set-Location build ; cmake .. -DCMAKE_BUILD_TYPE=Debug -G Ninja ; Set-Location ..
Set-Location build ; cmake .. -DCMAKE_BUILD_TYPE=Debug -G "Visual Studio 17 2022" ; Set-Location ..

cmake --build build --target executionTests -j 8 -v


运行由Ninja generator(gcc 13.1.0编译器)构建的可执行文件会得到以下结果:

Testing with 1000000 doubles...
Serial STL: Lowest: 9059 Highest: 4.29496e+09 Time: 75.064000 ms
Serial STL: Lowest: 9059 Highest: 4.29496e+09 Time: 78.308300 ms
Serial STL: Lowest: 9059 Highest: 4.29496e+09 Time: 77.079100 ms
Serial STL: Lowest: 9059 Highest: 4.29496e+09 Time: 77.511300 ms
Serial STL: Lowest: 9059 Highest: 4.29496e+09 Time: 76.836500 ms
Parallel STL: Lowest: 9059 Highest: 4.29496e+09 Time: 77.417900 ms
Parallel STL: Lowest: 9059 Highest: 4.29496e+09 Time: 77.452600 ms
Parallel STL: Lowest: 9059 Highest: 4.29496e+09 Time: 78.962000 ms
Parallel STL: Lowest: 9059 Highest: 4.29496e+09 Time: 80.188500 ms
Parallel STL: Lowest: 9059 Highest: 4.29496e+09 Time: 79.135000 ms


但是!使用“Visual Studio 17 2022”构建的可执行文件给出了下一个结果:

Testing with 1000000 doubles...
Serial STL: Lowest: 5059 Highest: 4.29497e+09 Time: 256.872900 ms
Serial STL: Lowest: 5059 Highest: 4.29497e+09 Time: 264.764000 ms
Serial STL: Lowest: 5059 Highest: 4.29497e+09 Time: 262.767800 ms
Serial STL: Lowest: 5059 Highest: 4.29497e+09 Time: 264.283300 ms
Serial STL: Lowest: 5059 Highest: 4.29497e+09 Time: 259.603600 ms
Parallel STL: Lowest: 5059 Highest: 4.29497e+09 Time: 86.583400 ms
Parallel STL: Lowest: 5059 Highest: 4.29497e+09 Time: 81.407500 ms
Parallel STL: Lowest: 5059 Highest: 4.29497e+09 Time: 81.962600 ms
Parallel STL: Lowest: 5059 Highest: 4.29497e+09 Time: 88.384000 ms
Parallel STL: Lowest: 5059 Highest: 4.29497e+09 Time: 84.420800 ms


在这一点上,我应该看到在使用GCC编译器编译后,使用std::execution::par选项与基本排序的速度差异,但我只看到MVSC编译器的差异。为什么呢?顺便说一下,如果我把std::execution::par改为std::execution::seq -什么也没改变。
以下是通过Ninja build generator进行的详细编译和链接:

[1/2] L:\UCRT_GCC-13-1-0_x64\mingw64\bin\c++.exe   -g -O3 -std=gnu++20 -MD -MT CMakeFiles/executionTests.dir/targets/executionTests.cpp.obj -MF CMakeFiles\executionTests.dir\targets\executionTests.cpp.obj.d -o CMakeFiles/executionTests.dir/targets/executionTests.cpp.obj -c ${WorkspaceFolder}/targets/executionTests.cpp
[2/2] cmd.exe /C "cd . && L:\UCRT_GCC-13-1-0_x64\mingw64\bin\c++.exe -g  CMakeFiles/executionTests.dir/targets/executionTests.cpp.obj -o ..\${OutputDir}\executionTests.exe -Wl,--out-implib,..\${OutputDir}\libexecutionTests.dll.a -Wl,--major-image-version,0,--minor-image-version,0  -lkernel32 -luser32 -lgdi32 -lwinspool -lshell32 -lole32 -loleaut32 -luuid -lcomdlg32 -ladvapi32 && cd ."


以下是通过“Visual Studio 17 2022”构建生成器进行的详细编译和链接:

ClCompile:
     C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\HostX64\x64\CL.exe /c /Zi /nologo /W3 /WX- /diagnostics:column /O2 /Ob0 /D _MBCS /D WIN32 /D _WINDOWS /D "CMAKE_INTDIR=\"Debug\"" /Gm- /EHsc /MDd /GS /fp:precise /Zc:wchar_t /Zc:forScope /Zc:inli
     ne /GR /std:c++20 /Fo"executionTests.dir\Debug\\" /Fd"executionTests.dir\Debug\vc143.pdb" /external:W3 /Gd /TP /errorReport:queue ${WorkspaceFolder}\targets\executionTests.cpp
     executionTests.cpp
   Link:
     C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\HostX64\x64\link.exe /ERRORREPORT:QUEUE /OUT:"${OutputDir}\Debug\executionTests.exe" /INCREMENTAL /ILK:"executionTests
     .dir\Debug\executionTests.ilk" /NOLOGO kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /manifest:embed /DEBUG /PDB:"${OutputDir}/Debug/executionTests.pdb" /SUBSYSTEM:CONSOLE /TLBID:1 /DYNAMICBASE /NXCOMPAT /IMPLIB:"${OutputDir}/Debug/executionTests.lib" /MACHINE:X64  /machine:x64 
      executionTests.dir\Debug\executionTests.obj
     executionTests.vcxproj -> ${OutputDir}\Debug\executionTests.exe


我不知道我错过了什么。
GCC 13.1.0(可能更早)的STL实现是否可以使用-O3标志来提高速度,并且不需要std::执行?
或者也许我只是没有放置必要的标志来查看std::execution如何更好地提高性能,这意味着使用std::executiuon不到75-80毫秒?

eufgjt7s

eufgjt7s1#

问题似乎已解决。
Ted Lyngmo提到了一个非常重要的事实:
...当您包含<execution>时,它会检查是否可以找到tbb信头。如果它们可用,它将包括它们并使用tbb作为后端。如果它找不到可使用的后端,它将回退到std::execution::seq...
这对我来说是一个惊喜,如果我不显式地使用TBB头在我的代码-我仍然需要包括和链接TBB...
因此,我必须相应地修复CMakeLists.txt,以包含和链接TBB头文件和库。

cmake_minimum_required(VERSION 3.14.0)
project(EXEC VERSION 0.0.1)

set(CMAKE_C_STANDARD 17)
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# ! Set the TBB library path
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU")
    set(TBB_ROOT_LIB "L:/oneTBB/mingw64/mingw64/bin" CACHE PATH "Path to TBB") # ? Change this path according to your oneTBB location
    set(TBB_ROOT_INC "L:/oneTBB/mingw64/mingw64/include" CACHE PATH "Path to TBB") # ? Change this path according to your oneTBB location
elseif(CMAKE_CXX_COMPILER_ID MATCHES "MSVC")
    set(TBB_ROOT_LIB "L:/oneapi-tbb-2021.9.0-win/oneapi-tbb-2021.9.0/lib/intel64/vc14" CACHE PATH "Path to TBB") # ? Change this path according to your oneTBB location
    set(TBB_ROOT_INC "L:/oneapi-tbb-2021.9.0-win/oneapi-tbb-2021.9.0/include" CACHE PATH "Path to TBB") # ? Change this path according to your oneTBB location
endif()

add_executable(
    executionTests
    targets/executionTests.cpp
)

# ! When you include `<execution>` GCC checks if it can find the tbb headers.
target_include_directories(
    executionTests
    PUBLIC ${TBB_ROOT_INC}
)

# ! Also GCC need to link TBB library to use it's features, so specify path
target_link_directories(
    executionTests
    PRIVATE "${TBB_ROOT_LIB}"
)

# ! Libraries for MSVC and GCC are different
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU")
    # * I found only one version of TBB dll for mingw64,
    # * not sure if's Debug or Release version,
    # * but results very close to MSVC Release
    set(TBB_LIB "-llibtbb12")
elseif(CMAKE_CXX_COMPILER_ID MATCHES "MSVC")
    if(${CMAKE_BUILD_TYPE} STREQUAL "Debug")
        set(TBB_LIB "tbb_debug.lib")
    else()
        set(TBB_LIB "tbb.lib")
    endif()
endif()

# ! Finally link library
target_link_libraries(
    executionTests
    PRIVATE ${TBB_LIB}
)

if(CMAKE_CXX_COMPILER_ID MATCHES "GNU")
    target_compile_options(
        executionTests
        PRIVATE
        -O3
    )
elseif(CMAKE_CXX_COMPILER_ID MATCHES "MSVC")
    # ? I never mentioned why I have these lines:
    # ? /O2 (improve performance by reducing execution time and optimizing code size)
    # ? and /RTC1 (perform runtime checks) flags can't be combined
    # ? so this is a workaround to remove /RTC1 from resulted command line
    STRING(REGEX REPLACE "/RTC(su|[1su])" "" CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG}")
    STRING(REGEX REPLACE "/RTC(su|[1su])" "" CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG}")
    target_compile_options(
        executionTests
        PRIVATE
        /O2
    )
endif()

字符串
需要提到的是,MSVC需要在构建脚本中看到配置类型,而不仅仅是在配置脚本中,所以还需要添加以下内容:

cmake --build build --config Debug --target executionTests -j 8 -v
# cmake --build build --config Release --target executionTests -j 8 -v


已完成有用的代码更改:
C++头文件而不是C头文件(stddef.h、stdio.h>),以及其他有用的更改:

#include <cstddef>
#include <cstdio>

#include <ranges> // ! add this for std::ranges::generate

// * I came up with this template function to easy use std::mt19937
// * and generate doubles in a functor way  
template <typename _Type, _Type _left, _Type _right>
_Type generateRandomNumber()
{
    /*
    ! Your program looks portable as-is so it's just a matter of making
    ! the implementation use a backend for the execution
    ! policies. I would however not use `random_device` as a source since
    ! that will produce different results every time. Use
    ! a deterministic source so that you get the same content of the containers every time.
    ! `std::mt19937 rd;` would be better for that.
    */
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<_Type> dist(_left, _right);    //* get random number from 1 to 1000
    return dist(gen);
}

int main()
{  
    // ! %llu is the wrong printf conversion specifier for size_t. Use %zu.
    printf("Testing with %zu doubles...\n", testSize);
    vector<double> doubles(testSize);
    // * my idea how we can generate random doubles
    std::ranges::generate(doubles, generateRandomNumber<double, -100.0, 100.0>);

    // time how long it takes to sort them:
    for (size_t i = 0; i < iterationCount; ++i)
    {
        vector<double> sorted(doubles);
        /*
        ! Never use high_resolution_clock./
        ! The specification for that clock was messed up and
        ! even the author of that clock says to not use it.
        ! Use std::chrono::steady_clock instead.
        */
        const auto startTime = steady_clock::now();
        sort(sorted.begin(), sorted.end());
        const auto endTime = steady_clock::now();
        print_results("Serial STL", sorted, startTime, endTime);
    }

    for (size_t i = 0; i < iterationCount; ++i)
    {
        vector<double> sorted(doubles);
        const auto startTime = steady_clock::now();
        // same sort call as above, but with par_unseq:
        std::sort(std::execution::par, sorted.begin(), sorted.end());
        const auto endTime = steady_clock::now();
        // in our output, note that these are the parallel results:
        print_results("Parallel STL", sorted, startTime, endTime);
    }
}


现在GCC和MSVC的结果有很大的不同。

调试:

  • 通用条款:*
Testing with 1000000 doubles...
Serial STL: Lowest: -99.9999 Highest: 100 Time: 86.243000 ms
Serial STL: Lowest: -99.9999 Highest: 100 Time: 83.652300 ms
Serial STL: Lowest: -99.9999 Highest: 100 Time: 85.125400 ms
Serial STL: Lowest: -99.9999 Highest: 100 Time: 86.877800 ms
Serial STL: Lowest: -99.9999 Highest: 100 Time: 85.984300 ms
Parallel STL: Lowest: -99.9999 Highest: 100 Time: 35.638500 ms
Parallel STL: Lowest: -99.9999 Highest: 100 Time: 30.529100 ms
Parallel STL: Lowest: -99.9999 Highest: 100 Time: 35.590700 ms
Parallel STL: Lowest: -99.9999 Highest: 100 Time: 29.676400 ms
Parallel STL: Lowest: -99.9999 Highest: 100 Time: 33.012500 ms

  • MSVC:*
Testing with 1000000 doubles...
Serial STL: Lowest: -99.9999 Highest: 99.9999 Time: 277.651300 ms
Serial STL: Lowest: -99.9999 Highest: 99.9999 Time: 281.134900 ms
Serial STL: Lowest: -99.9999 Highest: 99.9999 Time: 278.242000 ms
Serial STL: Lowest: -99.9999 Highest: 99.9999 Time: 280.372500 ms
Serial STL: Lowest: -99.9999 Highest: 99.9999 Time: 275.779000 ms
Parallel STL: Lowest: -99.9999 Highest: 99.9999 Time: 98.904400 ms
Parallel STL: Lowest: -99.9999 Highest: 99.9999 Time: 94.853300 ms
Parallel STL: Lowest: -99.9999 Highest: 99.9999 Time: 100.861400 ms
Parallel STL: Lowest: -99.9999 Highest: 99.9999 Time: 92.364000 ms
Parallel STL: Lowest: -99.9999 Highest: 99.9999 Time: 102.100400 ms

发布日期:

  • 通用条款:*
Testing with 1000000 doubles...
Serial STL: Lowest: -99.9998 Highest: 100 Time: 77.569600 ms
Serial STL: Lowest: -99.9998 Highest: 100 Time: 83.123500 ms
Serial STL: Lowest: -99.9998 Highest: 100 Time: 81.983300 ms
Serial STL: Lowest: -99.9998 Highest: 100 Time: 82.967000 ms
Serial STL: Lowest: -99.9998 Highest: 100 Time: 82.845600 ms
Parallel STL: Lowest: -99.9998 Highest: 100 Time: 34.475000 ms
Parallel STL: Lowest: -99.9998 Highest: 100 Time: 34.092200 ms
Parallel STL: Lowest: -99.9998 Highest: 100 Time: 30.292100 ms
Parallel STL: Lowest: -99.9998 Highest: 100 Time: 33.041200 ms
Parallel STL: Lowest: -99.9998 Highest: 100 Time: 30.095900 ms

  • MSVC:*
Testing with 1000000 doubles...
Serial STL: Lowest: -99.9998 Highest: 100 Time: 95.452600 ms
Serial STL: Lowest: -99.9998 Highest: 100 Time: 98.047800 ms
Serial STL: Lowest: -99.9998 Highest: 100 Time: 97.359000 ms
Serial STL: Lowest: -99.9998 Highest: 100 Time: 96.975000 ms
Serial STL: Lowest: -99.9998 Highest: 100 Time: 98.612100 ms
Parallel STL: Lowest: -99.9998 Highest: 100 Time: 35.154200 ms
Parallel STL: Lowest: -99.9998 Highest: 100 Time: 35.384300 ms
Parallel STL: Lowest: -99.9998 Highest: 100 Time: 35.499900 ms
Parallel STL: Lowest: -99.9998 Highest: 100 Time: 34.143500 ms
Parallel STL: Lowest: -99.9998 Highest: 100 Time: 33.570500 ms


GCC又出问题了)可能我需要链接另一个调试版本的TBB库。
不过总体的问题现在解决了。非常感谢Ted Lyngmo。

相关问题