为什么Onnxruntime在C++中的运行速度比Python慢2- 3倍？

4xrmg8kj 于 2023-01-29 发布在 Python

关注(0)|答案(1)|浏览(1477)

我有一个代码，运行3个推理会话一个接一个。我遇到的问题是，它只运行在我的Mac和Windows VM（VMWare）在我的Mac上运行的最佳性能。它需要58 - 68秒之间运行我的测试集。
当我问其他使用windows的人（使用类似的硬件：Intel i7 6 - 8核）来测试，它运行在150s，如果我让同一个人使用等效的Python脚本运行推理，它运行的速度比这快2 - 3倍，与我原来的Mac机器不相上下。
我不知道还能尝试什么。下面是代码的相关部分：

#include "onnxruntime-osx-universal2-1.13.1/include/onnxruntime_cxx_api.h"
// ...
Ort::Env OrtEnv;
Ort::Session objectNet{OrtEnv, objectModelBuffer.constData(), (size_t) objectModelBuffer.size(), Ort::SessionOptions{}}; // x3, one for each model

std::vector<uint16_t> inputTensorValues;
normalize(img, {aiPanoWidth, aiPanoHeight}, inputTensorValues); // convert the cv:Mat imp into std::vector<uint16_t>

std::array<int64_t, 4> input_shape_{ 1, 3, aiPanoHeight, aiPanoWidth };

auto allocator_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
Ort::Value input_tensor_ = Ort::Value::CreateTensor(allocator_info, inputTensorValues.data(), sizeof(uint16_t) * inputTensorValues.size(), input_shape_.data(), input_shape_.size(), ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT16);

const char* input_names[] = { "images" };
const char* output_names[] = { "output" };
std::vector<Ort::Value> ort_outputs = objectNet.Run(Ort::RunOptions{ nullptr }, input_names, &input_tensor_, 1, output_names, 1);

//... after this I read the output, but the step above is already 2-3x slower on C++ than Python

更多详情：

上述代码在后台工作线程中运行（由于GUI在主线程中运行，因此需要）
我使用float16来减少AI模型的内存占用
我使用了微软提供的vanila onnxruntime动态链接库（v1.13.1）
我用Mingw Gcc和VC ++2022编译了我的代码。两者的结果都很相似，比VC ++有一点优势。我相信我的代码的其他部分运行得更快，不一定是推理。
我不想在GPU上运行它。
我使用/arch：AVX/openmp-O2和-lonnx运行时进行编译

python

来源：https://stackoverflow.com/questions/75241204/why-onnxruntime-runs-2-3x-slower-in-c-than-python

1条答案

按热度按时间

0vvn1miw1#

经过近一周的不眠不休的分析，我能够通过调整会话的线程选项来显著提高（最多2倍）Windows PC上的性能。

Ort::SessionOptions s;
s.SetInterOpNumThreads(1);
s.SetIntraOpNumThreads(std::min(6, (int) std::thread::hardware_concurrency()));
s.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);

Ort::Session objectNet{OrtEnv, objectModelBuffer.constData(), (size_t) objectModelBuffer.size(), s};

我认为发生的情况是OnnxRuntime分配了过多的线程，线程内通信/同步开销变得很大。
由于硬编码值不是一个好的实践，我从线程库中提取了CPU的数量，并将Onnxruntime设置为这个最大值（或最大值6）。我害怕尝试增加到6以上，再次得到糟糕的结果。我在Mac上测试（6核i7），性能和以前一样。在我的Windows VM中，它比以前快了22%。在我朋友的Windows PC（8核i7）中，它比以前快了2倍。
我真的希望OnnxRuntime能更好地优化可用资源。
我需要提到的另一件事是，将模型从FP 16恢复到FP 32对这个结果有一点帮助，特别是在Windows PC平台上。在我的Mac和Windows VM中，差异可以忽略不计。

赞(0）回复(0）举报 2023-01-29

我来回答

为什么Onnxruntime在C++中的运行速度比Python慢2- 3倍？

1条答案

相关问题

热门标签

最新问答