bug描述 Describe the Bug
在 Jetpack 使用GPU 对PaddleOCR进行文字识别训练得到 warp-ctc [version 2] Error in get_workspace_size: execution failed。
如果使用CPU(Global.use_gpu=false)可以正常训练,但是使用GPU(Global.use_gpu=true)就不行
- 系统环境
- NVIDIA Jetson AGX Xavier [16GB]
* Jetpack 4.6 [L4T 32.6.1]
* NV Power Mode: MAXN - Type: 0
* jetson_stats.service: active
- Libraries:
* CUDA: 10.2.300
* cuDNN: 8.2.1.32
* TensorRT: 8.0.1.6
* Visionworks: 1.6.0.501
* OpenCV: 4.1.1 compiled CUDA: NO
* VPI: ii libnvvpi1 1.1.15 arm64 NVIDIA Vision Programming Interface library
* Vulkan: 1.2.70
- 版本信息:
- Paddle: paddlepaddle-gpu 2.3.2
PaddlePaddle 2.3.2, compiled with
with_avx: OFF
with_gpu: ON
with_mkl: OFF
with_mkldnn: OFF
with_python: ON
- PaddleOCR:2.6.1.0
- 运行指令
python tools/train.py -c configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml -o Global.pretrained_model=./pretrain_models/ch_PP-OCRv3_rec_train/best_accuracy
- 完整报错
[2022/11/17 17:20:58] ppocr INFO: Backbone :
[2022/11/17 17:20:58] ppocr INFO: last_conv_stride : [1, 2]
[2022/11/17 17:20:58] ppocr INFO: last_pool_type : avg
[2022/11/17 17:20:58] ppocr INFO: name : MobileNetV1Enhance
[2022/11/17 17:20:58] ppocr INFO: scale : 0.5
[2022/11/17 17:20:58] ppocr INFO: Head :
[2022/11/17 17:20:58] ppocr INFO: head_list :
[2022/11/17 17:20:58] ppocr INFO: CTCHead :
[2022/11/17 17:20:58] ppocr INFO: Head :
[2022/11/17 17:20:58] ppocr INFO: fc_decay : 1e-05
[2022/11/17 17:20:58] ppocr INFO: Neck :
[2022/11/17 17:20:58] ppocr INFO: depth : 2
[2022/11/17 17:20:58] ppocr INFO: dims : 64
[2022/11/17 17:20:58] ppocr INFO: hidden_dims : 120
[2022/11/17 17:20:58] ppocr INFO: name : svtr
[2022/11/17 17:20:58] ppocr INFO: use_guide : True
[2022/11/17 17:20:58] ppocr INFO: SARHead :
[2022/11/17 17:20:58] ppocr INFO: enc_dim : 512
[2022/11/17 17:20:58] ppocr INFO: max_text_length : 25
[2022/11/17 17:20:58] ppocr INFO: name : MultiHead
[2022/11/17 17:20:58] ppocr INFO: Transform : None
[2022/11/17 17:20:58] ppocr INFO: algorithm : SVTR
[2022/11/17 17:20:58] ppocr INFO: model_type : rec
[2022/11/17 17:20:58] ppocr INFO: Eval :
[2022/11/17 17:20:58] ppocr INFO: dataset :
[2022/11/17 17:20:58] ppocr INFO: data_dir : ./train_data/rec/
[2022/11/17 17:20:58] ppocr INFO: label_file_list : ['./train_data/rec/val.txt']
[2022/11/17 17:20:58] ppocr INFO: name : SimpleDataSet
[2022/11/17 17:20:58] ppocr INFO: transforms :
[2022/11/17 17:20:58] ppocr INFO: DecodeImage :
[2022/11/17 17:20:58] ppocr INFO: channel_first : False
[2022/11/17 17:20:58] ppocr INFO: img_mode : BGR
[2022/11/17 17:20:58] ppocr INFO: MultiLabelEncode : None
[2022/11/17 17:20:58] ppocr INFO: RecResizeImg :
[2022/11/17 17:20:58] ppocr INFO: image_shape : [3, 48, 320]
[2022/11/17 17:20:58] ppocr INFO: KeepKeys :
[2022/11/17 17:20:58] ppocr INFO: keep_keys : ['image', 'label_ctc', 'label_sar', 'length', 'valid_ratio']
[2022/11/17 17:20:58] ppocr INFO: loader :
[2022/11/17 17:20:58] ppocr INFO: batch_size_per_card : 4
[2022/11/17 17:20:58] ppocr INFO: drop_last : False
[2022/11/17 17:20:58] ppocr INFO: num_workers : 2
[2022/11/17 17:20:58] ppocr INFO: shuffle : False
[2022/11/17 17:20:58] ppocr INFO: Global :
[2022/11/17 17:20:58] ppocr INFO: cal_metric_during_train : True
[2022/11/17 17:20:58] ppocr INFO: character_dict_path : ppocr/utils/ppocr_keys_v1.txt
[2022/11/17 17:20:58] ppocr INFO: checkpoints : None
[2022/11/17 17:20:58] ppocr INFO: debug : False
[2022/11/17 17:20:58] ppocr INFO: distributed : False
[2022/11/17 17:20:58] ppocr INFO: epoch_num : 500
[2022/11/17 17:20:58] ppocr INFO: eval_batch_step : [0, 2000]
[2022/11/17 17:20:58] ppocr INFO: infer_img : doc/imgs_words/ch/word_1.jpg
[2022/11/17 17:20:58] ppocr INFO: infer_mode : False
[2022/11/17 17:20:58] ppocr INFO: log_smooth_window : 20
[2022/11/17 17:20:58] ppocr INFO: max_text_length : 25
[2022/11/17 17:20:58] ppocr INFO: pretrained_model : None
[2022/11/17 17:20:58] ppocr INFO: print_batch_step : 10
[2022/11/17 17:20:58] ppocr INFO: save_epoch_step : 10
[2022/11/17 17:20:58] ppocr INFO: save_inference_dir : None
[2022/11/17 17:20:58] ppocr INFO: save_model_dir : ./output/rec_ppocr_v3
[2022/11/17 17:20:58] ppocr INFO: save_res_path : ./output/rec/predicts_ppocrv3.txt
[2022/11/17 17:20:58] ppocr INFO: use_dynamic_loss_scaling : True
[2022/11/17 17:20:58] ppocr INFO: use_gpu : True
[2022/11/17 17:20:58] ppocr INFO: use_space_char : True
[2022/11/17 17:20:58] ppocr INFO: use_visualdl : False
[2022/11/17 17:20:58] ppocr INFO: Loss :
[2022/11/17 17:20:58] ppocr INFO: loss_config_list :
[2022/11/17 17:20:58] ppocr INFO: CTCLoss : None
[2022/11/17 17:20:58] ppocr INFO: SARLoss : None
[2022/11/17 17:20:58] ppocr INFO: name : MultiLoss
[2022/11/17 17:20:58] ppocr INFO: Metric :
[2022/11/17 17:20:58] ppocr INFO: ignore_space : False
[2022/11/17 17:20:58] ppocr INFO: main_indicator : acc
[2022/11/17 17:20:58] ppocr INFO: name : RecMetric
[2022/11/17 17:20:58] ppocr INFO: Optimizer :
[2022/11/17 17:20:58] ppocr INFO: beta1 : 0.9
[2022/11/17 17:20:58] ppocr INFO: beta2 : 0.999
[2022/11/17 17:20:58] ppocr INFO: lr :
[2022/11/17 17:20:58] ppocr INFO: learning_rate : 0.001
[2022/11/17 17:20:58] ppocr INFO: name : Cosine
[2022/11/17 17:20:58] ppocr INFO: warmup_epoch : 5
[2022/11/17 17:20:58] ppocr INFO: name : Adam
[2022/11/17 17:20:58] ppocr INFO: regularizer :
[2022/11/17 17:20:58] ppocr INFO: factor : 3e-05
[2022/11/17 17:20:58] ppocr INFO: name : L2
[2022/11/17 17:20:58] ppocr INFO: PostProcess :
[2022/11/17 17:20:58] ppocr INFO: name : CTCLabelDecode
[2022/11/17 17:20:58] ppocr INFO: Train :
[2022/11/17 17:20:58] ppocr INFO: dataset :
[2022/11/17 17:20:58] ppocr INFO: data_dir : ./train_data/rec/
[2022/11/17 17:20:58] ppocr INFO: ext_op_transform_idx : 1
[2022/11/17 17:20:58] ppocr INFO: label_file_list : ['./train_data/rec/train.txt']
[2022/11/17 17:20:58] ppocr INFO: name : SimpleDataSet
[2022/11/17 17:20:58] ppocr INFO: transforms :
[2022/11/17 17:20:58] ppocr INFO: DecodeImage :
[2022/11/17 17:20:58] ppocr INFO: channel_first : False
[2022/11/17 17:20:58] ppocr INFO: img_mode : BGR
[2022/11/17 17:20:58] ppocr INFO: RecConAug :
[2022/11/17 17:20:58] ppocr INFO: ext_data_num : 2
[2022/11/17 17:20:58] ppocr INFO: image_shape : [48, 320, 3]
[2022/11/17 17:20:58] ppocr INFO: prob : 0.5
[2022/11/17 17:20:58] ppocr INFO: RecAug : None
[2022/11/17 17:20:58] ppocr INFO: MultiLabelEncode : None
[2022/11/17 17:20:58] ppocr INFO: RecResizeImg :
[2022/11/17 17:20:58] ppocr INFO: image_shape : [3, 48, 320]
[2022/11/17 17:20:58] ppocr INFO: KeepKeys :
[2022/11/17 17:20:58] ppocr INFO: keep_keys : ['image', 'label_ctc', 'label_sar', 'length', 'valid_ratio']
[2022/11/17 17:20:58] ppocr INFO: loader :
[2022/11/17 17:20:58] ppocr INFO: batch_size_per_card : 4
[2022/11/17 17:20:58] ppocr INFO: drop_last : True
[2022/11/17 17:20:58] ppocr INFO: num_workers : 2
[2022/11/17 17:20:58] ppocr INFO: shuffle : True
[2022/11/17 17:20:58] ppocr INFO: profiler_options : None
[2022/11/17 17:20:58] ppocr INFO: train with paddle 2.3.2 and device Place(gpu:0)
[2022/11/17 17:20:58] ppocr INFO: Initialize indexs of datasets:['./train_data/rec/train.txt']
[2022/11/17 17:20:58] ppocr INFO: Initialize indexs of datasets:['./train_data/rec/val.txt']
W1117 17:20:58.723132 1747 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.2, Driver API Version: 10.2, Runtime API Version: 10.2
W1117 17:20:58.728123 1747 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2022/11/17 17:21:01] ppocr INFO: train dataloader has 87 iters
[2022/11/17 17:21:01] ppocr INFO: valid dataloader has 30 iters
[2022/11/17 17:21:01] ppocr INFO: train from scratch
[2022/11/17 17:21:01] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 2000 iterations
Traceback (most recent call last):
File "tools/train.py", line 202, in <module>
main(config, device, logger, vdl_writer)
File "tools/train.py", line 177, in main
eval_class, pre_best_model_dict, logger, vdl_writer, scaler,amp_level, amp_custom_black_list)
File "/opt/pp/projects/PaddleOCR/tools/program.py", line 302, in train
loss = loss_class(preds, batch)
File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
return self._dygraph_call_func(*inputs, **kwargs)
File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/opt/pp/projects/PaddleOCR/ppocr/losses/rec_multi_loss.py", line 48, in forward
batch[:2] + batch[3:])['loss'] * self.weight_1
File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
return self._dygraph_call_func(*inputs, **kwargs)
File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/opt/pp/projects/PaddleOCR/ppocr/losses/rec_ctc_loss.py", line 38, in forward
loss = self.loss_func(predicts, labels, preds_lengths, label_lengths)
File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
return self._dygraph_call_func(*inputs, **kwargs)
File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/nn/layer/loss.py", line 1134, in forward
norm_by_times=norm_by_times)
File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/nn/functional/loss.py", line 1130, in ctc_loss
input_lengths, label_lengths)
File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/fluid/layers/loss.py", line 613, in warpctc
norm_by_times, )
RuntimeError:
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 paddle::imperative::Tracer::TraceOp(std::string const&, paddle::imperative::NameVarBaseMap const&, paddle::imperative::NameVarBaseMap const&, paddle::framework::AttributeMap, std::map<std::string, std::string, std::less<std::string >, std::allocator<std::pair<std::string const, std::string > > > const&)
1 void paddle::imperative::Tracer::TraceOpImpl<paddle::imperative::VarBase>(std::string const&, paddle::imperative::details::NameVarMapTrait<paddle::imperative::VarBase>::Type const&, paddle::imperative::details::NameVarMapTrait<paddle::imperative::VarBase>::Type const&, paddle::framework::AttributeMap&, phi::Place const&, bool, std::map<std::string, std::string, std::less<std::string >, std::allocator<std::pair<std::string const, std::string > > > const&, paddle::framework::AttributeMap*, bool)
2 paddle::imperative::PreparedOp::Run(paddle::imperative::NameVarBaseMap const&, paddle::imperative::NameVarBaseMap const&, paddle::framework::AttributeMap const&, paddle::framework::AttributeMap const&)
3 phi::KernelImpl<void (*)(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, paddle::optional<phi::DenseTensor const&>, paddle::optional<phi::DenseTensor const&>, int, bool, phi::DenseTensor*, phi::DenseTensor*), &(void phi::WarpctcKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, paddle::optional<phi::DenseTensor const&>, paddle::optional<phi::DenseTensor const&>, int, bool, phi::DenseTensor*, phi::DenseTensor*))>::Compute(phi::KernelContext*)
4 void phi::WarpctcKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, paddle::optional<phi::DenseTensor const&>, paddle::optional<phi::DenseTensor const&>, int, bool, phi::DenseTensor*, phi::DenseTensor*)
5 phi::WarpCTCFunctor<phi::GPUContext, float>::operator()(phi::GPUContext const&, float const*, float*, int const*, int const*, int const*, unsigned long, unsigned long, unsigned long, float*)
6 phi::enforce::EnforceNotMet::EnforceNotMet(phi::ErrorSummary const&, char const*, int)
7 phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)
----------------------
Error Message Summary:
----------------------
PreconditionNotMetError: warp-ctc [version 2] Error in get_workspace_size: execution failed
[Hint: Expected CTC_STATUS_SUCCESS == status, but received CTC_STATUS_SUCCESS:0 != status:3.] (at /home/paddle/data/xly/workspace/23282/Paddle/paddle/phi/kernels/impl/warpctc_kernel_impl.h:199)
[operator < warpctc > error]
其他补充信息 Additional Supplementary Information
paddlepaddle-gpu 采用的是这个: paddlepaddle-gpu
5条答案
按热度按时间cgvd09ve1#
您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看 官网API文档 、 常见问题 、 历史Issue 、 AI社区 来寻求解答。祝您生活愉快~
Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API , FAQ , Github Issue and AI community to get the answer.Have a nice day!
6xfqseft2#
可以去PaddleOCR下提issue: https://github.com/PaddlePaddle/PaddleOCR/issues
oxf4rvwz3#
可以去PaddleOCR下提issue: https://github.com/PaddlePaddle/PaddleOCR/issues
PaddleOCR 让我来你们这边提的。“看样子是CTCloss的kernel在jetson的gpu上不能正常使用” 麻烦帮忙看看这个问题哈~
PaddlePaddle/PaddleOCR#8357 (comment)
41zrol4v4#
好的,已反馈给API负责人~
20jt8wwn5#
你好,可以参考这个博客,降级下paddle版本试下: https://blog.csdn.net/qq_36038453/article/details/125844765