Paddle 使用cpu进行分布式训练时,在评估导出模型后报连接超时错误

5us2dqdw  于 5个月前  发布在  其他
关注(0)|答案(2)|浏览(48)
问题描述 / Problem Description

使用cpu进行分布式训练时,在评估导出模型后报gloo连接超时错误

运行环境 / Runtime Environment
  • OS:centos7.8
  • Paddle:2.6.1
  • PaddleOCR:2.7.3
复现代码 / Reproduction Code

已经重现了几次,每次都是在评估模型后报错,浮现率100%

完整报错 / Complete Error Message

eval model:: 100%|█████████▉| 1056/1061 [40:19<00:11, 2.30s/it]
eval model:: 100%|█████████▉| 1057/1061 [40:21<00:09, 2.29s/it]
eval model:: 100%|█████████▉| 1058/1061 [40:23<00:06, 2.31s/it]
eval model:: 100%|█████████▉| 1059/1061 [40:26<00:04, 2.29s/it]
eval model:: 100%|█████████▉| 1060/1061 [40:28<00:02, 2.28s/it]
eval model:: 100%|██████████| 1061/1061 [40:30<00:00, 2.29s/it]
eval model:: 100%|██████████| 1061/1061 [40:30<00:00, 2.29s/it]
[2024/07/19 09:07:20] ppocr INFO: cur metric, precision: 0.0, recall: 0.0, hmean: 0, fps: 0.43905768252441396
[2024/07/19 09:07:20] ppocr INFO: save best model is to ./output/db_mv3/best_accuracy
[2024/07/19 09:07:20] ppocr INFO: best metric, hmean: 0, is_float16: False, precision: 0.0, recall: 0.0, fps: 0.43905768252441396, best_epoch: 71, start_epoch: 58
terminate called after throwing an instance of 'gloo::TimeoutException'
what(): [/paddle/third_party/gloo/gloo/transport/tcp/pair.cc:587] TIMEOUT self_rank = 0 pair_rank = 1 peer_str = [172.16.32.105]:36265

C++ Traceback (most recent call last):

0 paddle::pybind::ThrowExceptionToPython(std::__exception_ptr::exception_ptr)

Error Message Summary:

FatalError: Process abort signal is detected by the operating system.
[TimeInfo: *** Aborted at 1721351251 (unix time) try "date -d @1721351251" if you are using GNU date ***]
[SignalInfo: *** SIGABRT (@0x3e8000025ff) received by PID 9727 (TID 0x7fbd951ca180) from PID 9727 ***]

LAUNCH INFO 2024-07-19 09:07:31,757 Pod failed
LAUNCH ERROR 2024-07-19 09:07:31,758 Container failed !!!
Container rank 0 status failed cmd ['/usr/local/python-3.10/bin/python3', '-u', 'tools/train.py', '-c', 'configs/det/det_mv3_db.yml', '-o', 'Global.pretrained_model=./pretrain_models/MobileNetV3_large_x0_5_pretrained', '-o', 'Global.checkpoints=./output/db_mv3/best_accuracy'] code -6 log log/workerlog.0
env {'XDG_SESSION_ID': '177', 'HOSTNAME': 'paddle1', 'TERM': 'xterm', 'SHELL': '/bin/bash', 'HISTSIZE': '1000', 'LD_PRELOAD': '/usr/local/lib/libjemalloc.so', 'CPU_NUM': '2', 'USER': 'paddleocr', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.Z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.jpg=01;35:.jpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.axv=01;35:.anx=01;35:.ogv=01;35:.ogx=01;35:.aac=01;36:.au=01;36:.flac=01;36:.mid=01;36:.midi=01;36:.mka=01;36:.mp3=01;36:.mpc=01;36:.ogg=01;36:.ra=01;36:.wav=01;36:.axa=01;36:.oga=01;36:.spx=01;36:*.xspf=01;36:', 'LD_LIBRARY_PATH': '/usr/local/python-3.10/lib/python3.10/site-packages/cv2/../../lib64:/usr/local/openssl/lib:', 'PATH': '/usr/local/jdk1.8.0_202/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/paddleocr/.local/bin:/home/paddleocr/bin', 'MAIL': '/var/spool/mail/paddleocr', 'PWD': '/home/paddleocr/train/PaddleOCR', 'LANG': 'en_US.UTF-8', 'HISTCONTROL': 'ignoredups', 'HOME': '/home/paddleocr', 'SHLVL': '2', 'LOGNAME': 'paddleocr', 'LESSOPEN': '||/usr/bin/lesspipe.sh %s', '_': '/usr/local/python-3.10/bin/python3', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'QT_QPA_PLATFORM_PLUGIN_PATH': '/usr/local/python-3.10/lib/python3.10/site-packages/cv2/qt/plugins', 'QT_QPA_FONTDIR': '/usr/local/python-3.10/lib/python3.10/site-packages/cv2/qt/fonts', 'POD_NAME': 'oqqaos', 'PADDLE_GLOBAL_SIZE': '14', 'PADDLE_LOCAL_SIZE': '1', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '14', 'PADDLE_CURRENT_ENDPOINT': '172.16.32.107:6070', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '14', 'PADDLE_RANK_IN_NODE': '0', 'PADDLE_TRAINER_ENDPOINTS': '172.16.32.107:6070,172.16.32.105:6070,172.16.32.109:6070,172.16.32.110:6070,172.16.32.111:6070,172.16.32.112:6070,172.16.32.113:6070,172.16.32.114:6070,172.16.32.116:6070,172.16.32.118:6070,172.16.32.119:6070,172.16.32.120:6070,172.16.32.121:6070,172.16.32.122:6070', 'PADDLE_DISTRI_BACKEND': 'gloo', 'PADDLE_LOG_DIR': '/home/paddleocr/train/PaddleOCR/log'}
LAUNCH INFO 2024-07-19 09:07:31,758 ------------------------- ERROR LOG DETAIL -------------------------
s/it]
eval model:: 98%|█████████▊| 1041/1061 [39:44<00:45, 2.29s/it]
eval model:: 98%|█████████▊| 1042/1061 [39:47<00:43, 2.28s/it]
eval model:: 98%|█████████▊| 1043/1061 [39:49<00:41, 2.30s/it]
eval model:: 98%|█████████▊| 1044/1061 [39:51<00:39, 2.29s/it]
eval model:: 98%|█████████▊| 1045/1061 [39:54<00:36, 2.28s/it]
eval model:: 99%|█████████▊| 1046/1061 [39:56<00:34, 2.28s/it]
eval model:: 99%|█████████▊| 1047/1061 [39:58<00:31, 2.28s/it]
eval model:: 99%|█████████▉| 1048/1061 [40:00<00:29, 2.28s/it]
eval model:: 99%|█████████▉| 1049/1061 [40:03<00:27, 2.29s/it]
eval model:: 99%|█████████▉| 1050/1061 [40:05<00:25, 2.29s/it]
eval model:: 99%|█████████▉| 1051/1061 [40:07<00:22, 2.29s/it]
eval model:: 99%|█████████▉| 1052/1061 [40:10<00:20, 2.29s/it]
eval model:: 99%|█████████▉| 1053/1061 [40:12<00:18, 2.29s/it]
eval model:: 99%|█████████▉| 1054/1061 [40:14<00:16, 2.29s/it]
eval model:: 99%|█████████▉| 1055/1061 [40:16<00:13, 2.29s/it]
eval model:: 100%|█████████▉| 1056/1061 [40:19<00:11, 2.30s/it]
eval model:: 100%|█████████▉| 1057/1061 [40:21<00:09, 2.29s/it]
eval model:: 100%|█████████▉| 1058/1061 [40:23<00:06, 2.31s/it]
eval model:: 100%|█████████▉| 1059/1061 [40:26<00:04, 2.29s/it]
eval model:: 100%|█████████▉| 1060/1061 [40:28<00:02, 2.28s/it]
eval model:: 100%|██████████| 1061/1061 [40:30<00:00, 2.29s/it]
eval model:: 100%|██████████| 1061/1061 [40:30<00:00, 2.29s/it]
[2024/07/19 09:07:20] ppocr INFO: cur metric, precision: 0.0, recall: 0.0, hmean: 0, fps: 0.43905768252441396
[2024/07/19 09:07:20] ppocr INFO: save best model is to ./output/db_mv3/best_accuracy
[2024/07/19 09:07:20] ppocr INFO: best metric, hmean: 0, is_float16: False, precision: 0.0, recall: 0.0, fps: 0.43905768252441396, best_epoch: 71, start_epoch: 58
terminate called after throwing an instance of 'gloo::TimeoutException'
what(): [/paddle/third_party/gloo/gloo/transport/tcp/pair.cc:587] TIMEOUT self_rank = 0 pair_rank = 1 peer_str = [172.16.32.105]:36265

C++ Traceback (most recent call last):

0 paddle::pybind::ThrowExceptionToPython(std::__exception_ptr::exception_ptr)

Error Message Summary:

FatalError: Process abort signal is detected by the operating system.
[TimeInfo: *** Aborted at 1721351251 (unix time) try "date -d @1721351251" if you are using GNU date ***]
[SignalInfo: *** SIGABRT (@0x3e8000025ff) received by PID 9727 (TID 0x7fbd951ca180) from PID 9727 ***]

LAUNCH INFO 2024-07-19 09:07:31,760 Exit code -6

xv8emn3q

xv8emn3q1#

升级一下paddle框架版本到3.0 beta 试试看

tsm1rwdh

tsm1rwdh2#

升级一下paddle框架版本到3.0 beta 试试看

升级后尝试了几次,还是报错误,不过这次换了一个文件,还是报超时,报错如下:
[2024/07/20 13:53:08] ppocr INFO: save best model is to ./output/db_mv3/best_accuracy
[2024/07/20 13:53:08] ppocr INFO: best metric, hmean: 0, is_float16: False, precision: 0.0, recall: 0.0, fps: 0.4382822336757883, best_epoch: 92, start_epoch: 86
terminate called after throwing an instance of 'gloo::TimeoutException'
what(): [/paddle/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:83] TIMEOUT self_rank = none pair_rank = -1 peer_str = none

C++ Traceback (most recent call last):

0 paddle::pybind::ThrowExceptionToPython(std::__exception_ptr::exception_ptr)

Error Message Summary:

FatalError: Process abort signal is detected by the operating system.
[TimeInfo: *** Aborted at 1721454824 (unix time) try "date -d @1721454824" if you are using GNU date ***]
[SignalInfo: *** SIGABRT (@0x3e800000222) received by PID 546 (TID 0x7fe2efc45180) from PID 546 ***]

相关问题