Paddle 执行exe.run报错Cublas error, CUBLAS_STATUS_EXECUTION_FAILED

zzwlnbp8  于 2023-02-04  发布在  其他
关注(0)|答案(2)|浏览(325)

paddlepaddle-gpu1.8.3post107 (paddlepaddle-gpu版本为1.5.0,1.5.1,1.8.0时也会有这个错误)
cuda10.1 cudnn7.6
python3.7

复现: https://github.com/PaddlePaddle/Research/tree/master/KG/ACL2021_GRAN
问题:执行到exe.run时会出现如下错误:
Traceback (most recent call last):
File "./src/run.py", line 472, in
main(args)
File "./src/run.py", line 365, in main
outputs = train_exe.run(fetch_list=fetch_list)
File "/data/anaconda3/envs/py37/lib/python3.7/site-packages/paddle/fluid/parallel_executor.py", line 303, in run
return_numpy=return_numpy)
File "/data/anaconda3/envs/py37/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run
six.reraise(*sys.exc_info())
File "/data/anaconda3/envs/py37/lib/python3.7/site-packages/six.py", line 719, in reraise
raise value
File "/data/anaconda3/envs/py37/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run
return_merged=return_merged)
File "/data/anaconda3/envs/py37/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1167, in _run_impl
return_merged=return_merged)
File "/data/anaconda3/envs/py37/lib/python3.7/site-packages/paddle/fluid/executor.py", line 879, in _run_parallel
tensors = exe.run(fetch_var_names, return_merged)._move_to_list()
paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2 void paddle::operators::math::Blaspaddle::platform::CUDADeviceContext::MatMul(paddle::framework::Tensor const&, paddle::operators::math::MatDescriptor const&, paddle::framework::Tensor const&, paddle::operators::math::MatDescriptor const&, float, paddle::framework::Tensor*, float) const
3 paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const
4 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&) #1 }>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
5 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
6 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
7 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
8 paddle::framework::details::ComputationOpHandle::RunImpl()
9 paddle::framework::details::ThreadedSSAGraphExecutor::RunOpSync(paddle::framework::details::OpHandleBase*)
10 paddle::framework::details::ThreadedSSAGraphExecutor::RunTracedOps(std::vector<paddle::framework::details::OpHandleBase*, std::allocatorpaddle::framework::details::OpHandleBase* > const&)
11 paddle::framework::details::ThreadedSSAGraphExecutor::RunImpl(std::vector<std::string, std::allocatorstd::string > const&, bool)
12 paddle::framework::details::ThreadedSSAGraphExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&, bool)
13 paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&, bool)
14 paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&, bool)

Python Call Stacks (More useful to users):

File "/data/anaconda3/envs/py37/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2610, in append_op
attrs=kwargs.get("attrs", None))
File "/data/anaconda3/envs/py37/lib/python3.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op
return self.main_program.current_block().append_op(*args, **kwargs)
File "/data/anaconda3/envs/py37/lib/python3.7/site-packages/paddle/fluid/layers/nn.py", line 6416, in matmul
attrs=attrs)
File "/data/GRAN/src/model/gran_model.py", line 127, in _build_model
x=input_mask, y=input_mask, transpose_y=True)
File "/data/GRAN/src/model/gran_model.py", line 72, in init
self._build_model(input_ids, input_mask, edge_labels)
File "./src/run.py", line 137, in create_model
use_fp16=args.use_fp16)
File "./src/run.py", line 266, in main
pyreader_name='train_reader', config=config)
File "./src/run.py", line 472, in
main(args)

Error Message Summary:

ExternalError: Cublas error, CUBLAS_STATUS_EXECUTION_FAILED at (/paddle/paddle/fluid/operators/math/blas_impl.cu.h:61)
[operator < matmul > error]
terminate called without an active exception
W0304 21:29:48.526862 209578 init.cc:216] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0304 21:29:48.526906 209578 init.cc:218] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0304 21:29:48.526917 209578 init.cc:221] The detail failure signal is:

W0304 21:29:48.526929 209578 init.cc:224] *** Aborted at 1646400588 (unix time) try "date -d @1646400588" if you are using GNU date ***
W0304 21:29:48.531188 209578 init.cc:224] PC: @ 0x0 (unknown)
W0304 21:29:48.531322 209578 init.cc:224] *** SIGABRT (@0x3fb000331ca) received by PID 209354 (TID 0x7f0bcb241700) from PID 209354; stack trace: ***
W0304 21:29:48.534565 209578 init.cc:224] @ 0x7f0bf1f44390 (unknown)
W0304 21:29:48.535984 209578 init.cc:224] @ 0x7f0bf1b9e438 gsignal
W0304 21:29:48.537391 209578 init.cc:224] @ 0x7f0bf1ba003a abort
W0304 21:29:48.538328 209578 init.cc:224] @ 0x7f0b27f1c872 __gnu_cxx::__verbose_terminate_handler()
W0304 21:29:48.539134 209578 init.cc:224] @ 0x7f0b27f1af6f __cxxabiv1::__terminate()
W0304 21:29:48.540019 209578 init.cc:224] @ 0x7f0b27f1afb1 std::terminate()
W0304 21:29:48.540805 209578 init.cc:224] @ 0x7f0b27f1ac82 __gxx_personality_v0
W0304 21:29:48.541549 209578 init.cc:224] @ 0x7f0b4ba8cbc6 _Unwind_ForcedUnwind_Phase2
W0304 21:29:48.542297 209578 init.cc:224] @ 0x7f0b4ba8ceac _Unwind_ForcedUnwind
W0304 21:29:48.543699 209578 init.cc:224] @ 0x7f0bf1f43070 __GI___pthread_unwind
W0304 21:29:48.545024 209578 init.cc:224] @ 0x7f0bf1f3b845 __pthread_exit
W0304 21:29:48.545274 209578 init.cc:224] @ 0x561a3f47db09 PyThread_exit_thread
W0304 21:29:48.545343 209578 init.cc:224] @ 0x561a3f303e3e PyEval_RestoreThread.cold.742
W0304 21:29:48.546340 209578 init.cc:224] @ 0x7f0a724d1b19 pybind11::gil_scoped_release::~gil_scoped_release()
W0304 21:29:48.546478 209578 init.cc:224] @ 0x7f0a725b9eb6 ZZN8pybind1112cpp_function10initializeIZN6paddle6pybind10BindReaderEPNS_6moduleEEUlRNS2_9operators6reader22LoDTensorBlockingQueueERKSt6vectorINS2_9framework9LoDTensorESaISC_EEE1_bIS9_SG_EINS_4nameENS_9is_methodENS_7siblingENS_10call_guardIINS_18gil_scoped_releaseEEEEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNES11
W0304 21:29:48.547439 209578 init.cc:224] @ 0x7f0a724ef329 pybind11::cpp_function::dispatcher()
W0304 21:29:48.547737 209578 init.cc:224] @ 0x561a3f3e7ac4 _PyMethodDef_RawFastCallKeywords
W0304 21:29:48.547976 209578 init.cc:224] @ 0x561a3f41d861 _PyObject_FastCallKeywords
W0304 21:29:48.548120 209578 init.cc:224] @ 0x561a3f41e2d1 call_function
W0304 21:29:48.548375 209578 init.cc:224] @ 0x561a3f465602 _PyEval_EvalFrameDefault
W0304 21:29:48.548611 209578 init.cc:224] @ 0x561a3f3b759c _PyEval_EvalCodeWithName
W0304 21:29:48.548846 209578 init.cc:224] @ 0x561a3f3d6206 _PyFunction_FastCallDict
W0304 21:29:48.549104 209578 init.cc:224] @ 0x561a3f462a6d _PyEval_EvalFrameDefault
W0304 21:29:48.549325 209578 init.cc:224] @ 0x561a3f3d6d17 _PyFunction_FastCallKeywords
W0304 21:29:48.549470 209578 init.cc:224] @ 0x561a3f41e0c5 call_function
W0304 21:29:48.549722 209578 init.cc:224] @ 0x561a3f461381 _PyEval_EvalFrameDefault
W0304 21:29:48.549947 209578 init.cc:224] @ 0x561a3f3d6d17 _PyFunction_FastCallKeywords
W0304 21:29:48.550091 209578 init.cc:224] @ 0x561a3f41e0c5 call_function
W0304 21:29:48.550343 209578 init.cc:224] @ 0x561a3f461381 _PyEval_EvalFrameDefault
W0304 21:29:48.550572 209578 init.cc:224] @ 0x561a3f3b80a6 _PyObject_FastCallDict
W0304 21:29:48.550659 209578 init.cc:224] @ 0x561a3f3cd041 method_call
W0304 21:29:48.550915 209578 init.cc:224] @ 0x561a3f3b87b6 PyObject_Call

az31mfrm

az31mfrm1#

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看 官网API文档常见问题历史IssueAI社区 来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

smtd7mpg

smtd7mpg2#

repo相关问题建议去对应repo提问, https://github.com/PaddlePaddle/Research
参考repo的安装信息,建议按如下配置使用。看你的配置和他差异比较大。

This project should work fine with the following environments:

Python 2.7.15 for data preprocessing
Python 3.6.5 for training & evaluation with:
PaddlePaddle 1.5.0
numpy 1.16.3
GPU with CUDA 9.0, CuDNN v7, and NCCL 2.3.7
All the experiments are conducted on a single 16G V100 GPU.

问题和 PaddlePaddle/PGL#259 相似,麻烦也参考下。
应该是环境问题,建议使用docker。

相关问题