我有这样的代码来测试速度提高使用std::execution库在Windows 10上:
#include <stddef.h>
#include <stdio.h>
#include <algorithm>
#include <chrono>
#include <execution>
#include <random>
#include <ratio>
#include <vector>
using std::milli;
using std::random_device;
using std::sort;
using std::vector;
using std::chrono::duration;
using std::chrono::duration_cast;
using std::chrono::high_resolution_clock;
const size_t testSize = 1'000'000;
const int iterationCount = 5;
void print_results( //
const char* const tag, //
const vector<double>& sorted, //
high_resolution_clock::time_point startTime, //
high_resolution_clock::time_point endTime
//
)
{
printf("%s: Lowest: %g Highest: %g Time: %f ms\n", tag, sorted.front(), sorted.back(),
duration_cast<duration<double, milli>>(endTime - startTime).count());
}
int main()
{
random_device rd;
printf("Testing with %llu doubles...\n", testSize);
vector<double> doubles(testSize);
for (auto& d : doubles)
{
d = static_cast<double>(rd());
}
for (size_t i = 0; i < iterationCount; ++i)
{
vector<double> sorted(doubles);
const auto startTime = high_resolution_clock::now();
sort(sorted.begin(), sorted.end());
const auto endTime = high_resolution_clock::now();
print_results("Serial STL", sorted, startTime, endTime);
}
for (size_t i = 0; i < iterationCount; ++i)
{
vector<double> sorted(doubles);
const auto startTime = high_resolution_clock::now();
std::sort(std::execution::par, sorted.begin(), sorted.end());
const auto endTime = high_resolution_clock::now();
print_results("Parallel STL", sorted, startTime, endTime);
}
return 0;
}
字符串
我使用cmake和Ninja/MVSC作为生成器编译了这段代码。
下面是CMakeLists.txt代码:
cmake_minimum_required(VERSION 3.14.0)
project(EXEC VERSION 0.0.1)
set(CMAKE_C_STANDARD 17)
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
add_executable(
executionTests
targets/executionTests.cpp
)
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU")
target_compile_options(
executionTests
PRIVATE
-O3
)
elseif(CMAKE_CXX_COMPILER_ID MATCHES "MSVC")
STRING(REGEX REPLACE "/RTC(su|[1su])" "" CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG}")
STRING(REGEX REPLACE "/RTC(su|[1su])" "" CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG}")
target_compile_options(
executionTests
PRIVATE
/O2
)
endif()
型
和配置/构建脚本:
# Set-Location build ; cmake .. -DCMAKE_BUILD_TYPE=Debug -G Ninja ; Set-Location ..
Set-Location build ; cmake .. -DCMAKE_BUILD_TYPE=Debug -G "Visual Studio 17 2022" ; Set-Location ..
cmake --build build --target executionTests -j 8 -v
型
运行由Ninja generator(gcc 13.1.0编译器)构建的可执行文件会得到以下结果:
Testing with 1000000 doubles...
Serial STL: Lowest: 9059 Highest: 4.29496e+09 Time: 75.064000 ms
Serial STL: Lowest: 9059 Highest: 4.29496e+09 Time: 78.308300 ms
Serial STL: Lowest: 9059 Highest: 4.29496e+09 Time: 77.079100 ms
Serial STL: Lowest: 9059 Highest: 4.29496e+09 Time: 77.511300 ms
Serial STL: Lowest: 9059 Highest: 4.29496e+09 Time: 76.836500 ms
Parallel STL: Lowest: 9059 Highest: 4.29496e+09 Time: 77.417900 ms
Parallel STL: Lowest: 9059 Highest: 4.29496e+09 Time: 77.452600 ms
Parallel STL: Lowest: 9059 Highest: 4.29496e+09 Time: 78.962000 ms
Parallel STL: Lowest: 9059 Highest: 4.29496e+09 Time: 80.188500 ms
Parallel STL: Lowest: 9059 Highest: 4.29496e+09 Time: 79.135000 ms
型
但是!使用“Visual Studio 17 2022”构建的可执行文件给出了下一个结果:
Testing with 1000000 doubles...
Serial STL: Lowest: 5059 Highest: 4.29497e+09 Time: 256.872900 ms
Serial STL: Lowest: 5059 Highest: 4.29497e+09 Time: 264.764000 ms
Serial STL: Lowest: 5059 Highest: 4.29497e+09 Time: 262.767800 ms
Serial STL: Lowest: 5059 Highest: 4.29497e+09 Time: 264.283300 ms
Serial STL: Lowest: 5059 Highest: 4.29497e+09 Time: 259.603600 ms
Parallel STL: Lowest: 5059 Highest: 4.29497e+09 Time: 86.583400 ms
Parallel STL: Lowest: 5059 Highest: 4.29497e+09 Time: 81.407500 ms
Parallel STL: Lowest: 5059 Highest: 4.29497e+09 Time: 81.962600 ms
Parallel STL: Lowest: 5059 Highest: 4.29497e+09 Time: 88.384000 ms
Parallel STL: Lowest: 5059 Highest: 4.29497e+09 Time: 84.420800 ms
型
在这一点上,我应该看到在使用GCC编译器编译后,使用std::execution::par选项与基本排序的速度差异,但我只看到MVSC编译器的差异。为什么呢?顺便说一下,如果我把std::execution::par改为std::execution::seq -什么也没改变。
以下是通过Ninja build generator进行的详细编译和链接:
[1/2] L:\UCRT_GCC-13-1-0_x64\mingw64\bin\c++.exe -g -O3 -std=gnu++20 -MD -MT CMakeFiles/executionTests.dir/targets/executionTests.cpp.obj -MF CMakeFiles\executionTests.dir\targets\executionTests.cpp.obj.d -o CMakeFiles/executionTests.dir/targets/executionTests.cpp.obj -c ${WorkspaceFolder}/targets/executionTests.cpp
[2/2] cmd.exe /C "cd . && L:\UCRT_GCC-13-1-0_x64\mingw64\bin\c++.exe -g CMakeFiles/executionTests.dir/targets/executionTests.cpp.obj -o ..\${OutputDir}\executionTests.exe -Wl,--out-implib,..\${OutputDir}\libexecutionTests.dll.a -Wl,--major-image-version,0,--minor-image-version,0 -lkernel32 -luser32 -lgdi32 -lwinspool -lshell32 -lole32 -loleaut32 -luuid -lcomdlg32 -ladvapi32 && cd ."
型
以下是通过“Visual Studio 17 2022”构建生成器进行的详细编译和链接:
ClCompile:
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\HostX64\x64\CL.exe /c /Zi /nologo /W3 /WX- /diagnostics:column /O2 /Ob0 /D _MBCS /D WIN32 /D _WINDOWS /D "CMAKE_INTDIR=\"Debug\"" /Gm- /EHsc /MDd /GS /fp:precise /Zc:wchar_t /Zc:forScope /Zc:inli
ne /GR /std:c++20 /Fo"executionTests.dir\Debug\\" /Fd"executionTests.dir\Debug\vc143.pdb" /external:W3 /Gd /TP /errorReport:queue ${WorkspaceFolder}\targets\executionTests.cpp
executionTests.cpp
Link:
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\HostX64\x64\link.exe /ERRORREPORT:QUEUE /OUT:"${OutputDir}\Debug\executionTests.exe" /INCREMENTAL /ILK:"executionTests
.dir\Debug\executionTests.ilk" /NOLOGO kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /manifest:embed /DEBUG /PDB:"${OutputDir}/Debug/executionTests.pdb" /SUBSYSTEM:CONSOLE /TLBID:1 /DYNAMICBASE /NXCOMPAT /IMPLIB:"${OutputDir}/Debug/executionTests.lib" /MACHINE:X64 /machine:x64
executionTests.dir\Debug\executionTests.obj
executionTests.vcxproj -> ${OutputDir}\Debug\executionTests.exe
型
我不知道我错过了什么。
GCC 13.1.0(可能更早)的STL实现是否可以使用-O3标志来提高速度,并且不需要std::执行?
或者也许我只是没有放置必要的标志来查看std::execution如何更好地提高性能,这意味着使用std::executiuon不到75-80毫秒?
1条答案
按热度按时间eufgjt7s1#
问题似乎已解决。
Ted Lyngmo提到了一个非常重要的事实:
...当您包含
<execution>
时,它会检查是否可以找到tbb信头。如果它们可用,它将包括它们并使用tbb作为后端。如果它找不到可使用的后端,它将回退到std::execution::seq
...这对我来说是一个惊喜,如果我不显式地使用TBB头在我的代码-我仍然需要包括和链接TBB...
因此,我必须相应地修复CMakeLists.txt,以包含和链接TBB头文件和库。
字符串
需要提到的是,MSVC需要在构建脚本中看到配置类型,而不仅仅是在配置脚本中,所以还需要添加以下内容:
型
已完成有用的代码更改:
C++头文件而不是C头文件(stddef.h、stdio.h>),以及其他有用的更改:
型
现在GCC和MSVC的结果有很大的不同。
调试:
型
型
发布日期:
型
型
GCC又出问题了)可能我需要链接另一个调试版本的TBB库。
不过总体的问题现在解决了。非常感谢Ted Lyngmo。