bug描述 Describe the Bug

在 Jetpack 使用GPU 对PaddleOCR进行文字识别训练得到 warp-ctc [version 2] Error in get_workspace_size: execution failed。

如果使用CPU（Global.use_gpu=false）可以正常训练，但是使用GPU（Global.use_gpu=true）就不行

系统环境

- NVIDIA Jetson AGX Xavier [16GB]
   * Jetpack 4.6 [L4T 32.6.1]
   * NV Power Mode: MAXN - Type: 0
   * jetson_stats.service: active
 - Libraries:
   * CUDA: 10.2.300
   * cuDNN: 8.2.1.32
   * TensorRT: 8.0.1.6
   * Visionworks: 1.6.0.501
   * OpenCV: 4.1.1 compiled CUDA: NO
   * VPI: ii libnvvpi1 1.1.15 arm64 NVIDIA Vision Programming Interface library
   * Vulkan: 1.2.70

版本信息：

Paddle： paddlepaddle-gpu 2.3.2

PaddlePaddle 2.3.2, compiled with
    with_avx: OFF
    with_gpu: ON
    with_mkl: OFF
    with_mkldnn: OFF
    with_python: ON

PaddleOCR：2.6.1.0

运行指令

python tools/train.py -c configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml  -o Global.pretrained_model=./pretrain_models/ch_PP-OCRv3_rec_train/best_accuracy

完整报错

[2022/11/17 17:20:58] ppocr INFO:     Backbone : 
[2022/11/17 17:20:58] ppocr INFO:         last_conv_stride : [1, 2]
[2022/11/17 17:20:58] ppocr INFO:         last_pool_type : avg
[2022/11/17 17:20:58] ppocr INFO:         name : MobileNetV1Enhance
[2022/11/17 17:20:58] ppocr INFO:         scale : 0.5
[2022/11/17 17:20:58] ppocr INFO:     Head : 
[2022/11/17 17:20:58] ppocr INFO:         head_list : 
[2022/11/17 17:20:58] ppocr INFO:             CTCHead : 
[2022/11/17 17:20:58] ppocr INFO:                 Head : 
[2022/11/17 17:20:58] ppocr INFO:                     fc_decay : 1e-05
[2022/11/17 17:20:58] ppocr INFO:                 Neck : 
[2022/11/17 17:20:58] ppocr INFO:                     depth : 2
[2022/11/17 17:20:58] ppocr INFO:                     dims : 64
[2022/11/17 17:20:58] ppocr INFO:                     hidden_dims : 120
[2022/11/17 17:20:58] ppocr INFO:                     name : svtr
[2022/11/17 17:20:58] ppocr INFO:                     use_guide : True
[2022/11/17 17:20:58] ppocr INFO:             SARHead : 
[2022/11/17 17:20:58] ppocr INFO:                 enc_dim : 512
[2022/11/17 17:20:58] ppocr INFO:                 max_text_length : 25
[2022/11/17 17:20:58] ppocr INFO:         name : MultiHead
[2022/11/17 17:20:58] ppocr INFO:     Transform : None
[2022/11/17 17:20:58] ppocr INFO:     algorithm : SVTR
[2022/11/17 17:20:58] ppocr INFO:     model_type : rec
[2022/11/17 17:20:58] ppocr INFO: Eval : 
[2022/11/17 17:20:58] ppocr INFO:     dataset : 
[2022/11/17 17:20:58] ppocr INFO:         data_dir : ./train_data/rec/
[2022/11/17 17:20:58] ppocr INFO:         label_file_list : ['./train_data/rec/val.txt']
[2022/11/17 17:20:58] ppocr INFO:         name : SimpleDataSet
[2022/11/17 17:20:58] ppocr INFO:         transforms : 
[2022/11/17 17:20:58] ppocr INFO:             DecodeImage : 
[2022/11/17 17:20:58] ppocr INFO:                 channel_first : False
[2022/11/17 17:20:58] ppocr INFO:                 img_mode : BGR
[2022/11/17 17:20:58] ppocr INFO:             MultiLabelEncode : None
[2022/11/17 17:20:58] ppocr INFO:             RecResizeImg : 
[2022/11/17 17:20:58] ppocr INFO:                 image_shape : [3, 48, 320]
[2022/11/17 17:20:58] ppocr INFO:             KeepKeys : 
[2022/11/17 17:20:58] ppocr INFO:                 keep_keys : ['image', 'label_ctc', 'label_sar', 'length', 'valid_ratio']
[2022/11/17 17:20:58] ppocr INFO:     loader : 
[2022/11/17 17:20:58] ppocr INFO:         batch_size_per_card : 4
[2022/11/17 17:20:58] ppocr INFO:         drop_last : False
[2022/11/17 17:20:58] ppocr INFO:         num_workers : 2
[2022/11/17 17:20:58] ppocr INFO:         shuffle : False
[2022/11/17 17:20:58] ppocr INFO: Global : 
[2022/11/17 17:20:58] ppocr INFO:     cal_metric_during_train : True
[2022/11/17 17:20:58] ppocr INFO:     character_dict_path : ppocr/utils/ppocr_keys_v1.txt
[2022/11/17 17:20:58] ppocr INFO:     checkpoints : None
[2022/11/17 17:20:58] ppocr INFO:     debug : False
[2022/11/17 17:20:58] ppocr INFO:     distributed : False
[2022/11/17 17:20:58] ppocr INFO:     epoch_num : 500
[2022/11/17 17:20:58] ppocr INFO:     eval_batch_step : [0, 2000]
[2022/11/17 17:20:58] ppocr INFO:     infer_img : doc/imgs_words/ch/word_1.jpg
[2022/11/17 17:20:58] ppocr INFO:     infer_mode : False
[2022/11/17 17:20:58] ppocr INFO:     log_smooth_window : 20
[2022/11/17 17:20:58] ppocr INFO:     max_text_length : 25
[2022/11/17 17:20:58] ppocr INFO:     pretrained_model : None
[2022/11/17 17:20:58] ppocr INFO:     print_batch_step : 10
[2022/11/17 17:20:58] ppocr INFO:     save_epoch_step : 10
[2022/11/17 17:20:58] ppocr INFO:     save_inference_dir : None
[2022/11/17 17:20:58] ppocr INFO:     save_model_dir : ./output/rec_ppocr_v3
[2022/11/17 17:20:58] ppocr INFO:     save_res_path : ./output/rec/predicts_ppocrv3.txt
[2022/11/17 17:20:58] ppocr INFO:     use_dynamic_loss_scaling : True
[2022/11/17 17:20:58] ppocr INFO:     use_gpu : True
[2022/11/17 17:20:58] ppocr INFO:     use_space_char : True
[2022/11/17 17:20:58] ppocr INFO:     use_visualdl : False
[2022/11/17 17:20:58] ppocr INFO: Loss : 
[2022/11/17 17:20:58] ppocr INFO:     loss_config_list : 
[2022/11/17 17:20:58] ppocr INFO:         CTCLoss : None
[2022/11/17 17:20:58] ppocr INFO:         SARLoss : None
[2022/11/17 17:20:58] ppocr INFO:     name : MultiLoss
[2022/11/17 17:20:58] ppocr INFO: Metric : 
[2022/11/17 17:20:58] ppocr INFO:     ignore_space : False
[2022/11/17 17:20:58] ppocr INFO:     main_indicator : acc
[2022/11/17 17:20:58] ppocr INFO:     name : RecMetric
[2022/11/17 17:20:58] ppocr INFO: Optimizer : 
[2022/11/17 17:20:58] ppocr INFO:     beta1 : 0.9
[2022/11/17 17:20:58] ppocr INFO:     beta2 : 0.999
[2022/11/17 17:20:58] ppocr INFO:     lr : 
[2022/11/17 17:20:58] ppocr INFO:         learning_rate : 0.001
[2022/11/17 17:20:58] ppocr INFO:         name : Cosine
[2022/11/17 17:20:58] ppocr INFO:         warmup_epoch : 5
[2022/11/17 17:20:58] ppocr INFO:     name : Adam
[2022/11/17 17:20:58] ppocr INFO:     regularizer : 
[2022/11/17 17:20:58] ppocr INFO:         factor : 3e-05
[2022/11/17 17:20:58] ppocr INFO:         name : L2
[2022/11/17 17:20:58] ppocr INFO: PostProcess : 
[2022/11/17 17:20:58] ppocr INFO:     name : CTCLabelDecode
[2022/11/17 17:20:58] ppocr INFO: Train : 
[2022/11/17 17:20:58] ppocr INFO:     dataset : 
[2022/11/17 17:20:58] ppocr INFO:         data_dir : ./train_data/rec/
[2022/11/17 17:20:58] ppocr INFO:         ext_op_transform_idx : 1
[2022/11/17 17:20:58] ppocr INFO:         label_file_list : ['./train_data/rec/train.txt']
[2022/11/17 17:20:58] ppocr INFO:         name : SimpleDataSet
[2022/11/17 17:20:58] ppocr INFO:         transforms : 
[2022/11/17 17:20:58] ppocr INFO:             DecodeImage : 
[2022/11/17 17:20:58] ppocr INFO:                 channel_first : False
[2022/11/17 17:20:58] ppocr INFO:                 img_mode : BGR
[2022/11/17 17:20:58] ppocr INFO:             RecConAug : 
[2022/11/17 17:20:58] ppocr INFO:                 ext_data_num : 2
[2022/11/17 17:20:58] ppocr INFO:                 image_shape : [48, 320, 3]
[2022/11/17 17:20:58] ppocr INFO:                 prob : 0.5
[2022/11/17 17:20:58] ppocr INFO:             RecAug : None
[2022/11/17 17:20:58] ppocr INFO:             MultiLabelEncode : None
[2022/11/17 17:20:58] ppocr INFO:             RecResizeImg : 
[2022/11/17 17:20:58] ppocr INFO:                 image_shape : [3, 48, 320]
[2022/11/17 17:20:58] ppocr INFO:             KeepKeys : 
[2022/11/17 17:20:58] ppocr INFO:                 keep_keys : ['image', 'label_ctc', 'label_sar', 'length', 'valid_ratio']
[2022/11/17 17:20:58] ppocr INFO:     loader : 
[2022/11/17 17:20:58] ppocr INFO:         batch_size_per_card : 4
[2022/11/17 17:20:58] ppocr INFO:         drop_last : True
[2022/11/17 17:20:58] ppocr INFO:         num_workers : 2
[2022/11/17 17:20:58] ppocr INFO:         shuffle : True
[2022/11/17 17:20:58] ppocr INFO: profiler_options : None
[2022/11/17 17:20:58] ppocr INFO: train with paddle 2.3.2 and device Place(gpu:0)
[2022/11/17 17:20:58] ppocr INFO: Initialize indexs of datasets:['./train_data/rec/train.txt']
[2022/11/17 17:20:58] ppocr INFO: Initialize indexs of datasets:['./train_data/rec/val.txt']
W1117 17:20:58.723132  1747 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.2, Driver API Version: 10.2, Runtime API Version: 10.2
W1117 17:20:58.728123  1747 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2022/11/17 17:21:01] ppocr INFO: train dataloader has 87 iters
[2022/11/17 17:21:01] ppocr INFO: valid dataloader has 30 iters
[2022/11/17 17:21:01] ppocr INFO: train from scratch
[2022/11/17 17:21:01] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 2000 iterations
Traceback (most recent call last):
  File "tools/train.py", line 202, in <module>
    main(config, device, logger, vdl_writer)
  File "tools/train.py", line 177, in main
    eval_class, pre_best_model_dict, logger, vdl_writer, scaler,amp_level, amp_custom_black_list)
  File "/opt/pp/projects/PaddleOCR/tools/program.py", line 302, in train
    loss = loss_class(preds, batch)
  File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/opt/pp/projects/PaddleOCR/ppocr/losses/rec_multi_loss.py", line 48, in forward
    batch[:2] + batch[3:])['loss'] * self.weight_1
  File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/opt/pp/projects/PaddleOCR/ppocr/losses/rec_ctc_loss.py", line 38, in forward
    loss = self.loss_func(predicts, labels, preds_lengths, label_lengths)
  File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/nn/layer/loss.py", line 1134, in forward
    norm_by_times=norm_by_times)
  File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/nn/functional/loss.py", line 1130, in ctc_loss
    input_lengths, label_lengths)
  File "/opt/pp/mambaforge/envs/paddleocr/lib/python3.6/site-packages/paddle/fluid/layers/loss.py", line 613, in warpctc
    norm_by_times, )
RuntimeError: 

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::imperative::Tracer::TraceOp(std::string const&, paddle::imperative::NameVarBaseMap const&, paddle::imperative::NameVarBaseMap const&, paddle::framework::AttributeMap, std::map<std::string, std::string, std::less<std::string >, std::allocator<std::pair<std::string const, std::string > > > const&)
1   void paddle::imperative::Tracer::TraceOpImpl<paddle::imperative::VarBase>(std::string const&, paddle::imperative::details::NameVarMapTrait<paddle::imperative::VarBase>::Type const&, paddle::imperative::details::NameVarMapTrait<paddle::imperative::VarBase>::Type const&, paddle::framework::AttributeMap&, phi::Place const&, bool, std::map<std::string, std::string, std::less<std::string >, std::allocator<std::pair<std::string const, std::string > > > const&, paddle::framework::AttributeMap*, bool)
2   paddle::imperative::PreparedOp::Run(paddle::imperative::NameVarBaseMap const&, paddle::imperative::NameVarBaseMap const&, paddle::framework::AttributeMap const&, paddle::framework::AttributeMap const&)
3   phi::KernelImpl<void (*)(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, paddle::optional<phi::DenseTensor const&>, paddle::optional<phi::DenseTensor const&>, int, bool, phi::DenseTensor*, phi::DenseTensor*), &(void phi::WarpctcKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, paddle::optional<phi::DenseTensor const&>, paddle::optional<phi::DenseTensor const&>, int, bool, phi::DenseTensor*, phi::DenseTensor*))>::Compute(phi::KernelContext*)
4   void phi::WarpctcKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, paddle::optional<phi::DenseTensor const&>, paddle::optional<phi::DenseTensor const&>, int, bool, phi::DenseTensor*, phi::DenseTensor*)
5   phi::WarpCTCFunctor<phi::GPUContext, float>::operator()(phi::GPUContext const&, float const*, float*, int const*, int const*, int const*, unsigned long, unsigned long, unsigned long, float*)
6   phi::enforce::EnforceNotMet::EnforceNotMet(phi::ErrorSummary const&, char const*, int)
7   phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
PreconditionNotMetError: warp-ctc [version 2] Error in get_workspace_size: execution failed
  [Hint: Expected CTC_STATUS_SUCCESS == status, but received CTC_STATUS_SUCCESS:0 != status:3.] (at /home/paddle/data/xly/workspace/23282/Paddle/paddle/phi/kernels/impl/warpctc_kernel_impl.h:199)
  [operator < warpctc > error]

其他补充信息 Additional Supplementary Information

paddlepaddle-gpu 采用的是这个： paddlepaddle-gpu

5条答案

按热度按时间

cgvd09ve1#

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue 、 AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API ， FAQ ， Github Issue and AI community to get the answer.Have a nice day!

赞(0）回复(0）举报 2022-11-19

6xfqseft2#

可以去PaddleOCR下提issue： https://github.com/PaddlePaddle/PaddleOCR/issues

oxf4rvwz3#

PaddleOCR 让我来你们这边提的。“看样子是CTCloss的kernel在jetson的gpu上不能正常使用” 麻烦帮忙看看这个问题哈~

PaddlePaddle/PaddleOCR#8357 (comment)

41zrol4v4#

好的，已反馈给API负责人～

20jt8wwn5#

你好，可以参考这个博客，降级下paddle版本试下： https://blog.csdn.net/qq_36038453/article/details/125844765