请提出你的问题 Please ask your question
使用PaddleNLP进行多分类训练时,使用paddle.distributed.launch进行多核并发加快进度,使用命令参数如下:
python3 -m paddle.distributed.launch --nproc_per_node=16 train.py \
--do_train \
--do_export \
--train_path "./data/labeled_ch_content.csv" \
--label_path "./data/label.txt" \
--model_name_or_path ernie-3.0-tiny-medium-v2-zh \
--output_dir checkpoint \
--device cpu \
--num_train_epochs 100 \
--early_stopping True \
--early_stopping_patience 5 \
--learning_rate 3e-5 \
--max_length 1024 \
--per_device_eval_batch_size 8 \
--per_device_train_batch_size 8 \
--metric_for_best_model accuracy \
--load_best_model_at_end \
--logging_steps 5 \
--evaluation_strategy epoch \
--save_strategy epoch \
--save_total_limit 1
因为使用CPU模式,训练时间较长,每次训练两天以后就会在某个时间点报错:
TrainProcess: 12%|█▏ | 949/7900 [35:46:52<1359:09:27, 703.92s/it]
TrainProcess: 12%|█▏ | 950/7900 [35:49:36<1046:35:31, 542.12s/it][2024-05-28 23:08:09,609] [ INFO] - loss: 0.00859011, learning_rate: 2.932e-05, global_step: 950, interval_runtime: 2574.1901, interval_samples_per_second: 0.2486, interval_steps_per_second: 0.0019, progress_or_epoch: 12.0253
ng=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'SSH_CONNECTION': '172.16.254.238 43857 192.168.11.8 22', 'TOOLKIT_DOCKER_SUBNET': '10.2.3', 'DATA_PATH': '/data/data', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'XDG_SESSION_CLASS': 'user', 'TERM': 'screen', 'LOG_PATH': '/data/logs', 'LESSOPEN': '| /usr/bin/lesspipe %s', 'USER': 'user', 'TMUX_PANE': '%0', 'DOCKER_PATH': '/data/docker', 'SHLVL': '2', 'XDG_SESSION_ID': '576', 'XDG_RUNTIME_DIR': '/run/user/1000', 'SSH_CLIENT': '172.16.254.238 43857 22', 'ZIPINFO': '-O GBK', 'UNZIP': '-O GBK', 'ZLOG_PROFILE_ERROR': '/var/log/xtx/zlog_error.log', 'ZLOG_PROFILE_DEBUG': '/var/log/xtx/zlog_debug.log', 'PATH': '/home/user/.cargo/bin:/home/user/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games', 'LIC_PATH': '/data/licence', 'DBUS_SESSION_BUS_ADDRESS': 'unix:path=/run/user/1000/bus', 'SSH_TTY': '/dev/pts/0', 'BASE_PATH': '/data', 'OLDPWD': '/home/user/works/PaddleNLP/applications/text_classification/multi_class/data', '_': '/usr/bin/python3', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'QT_QPA_PLATFORM_PLUGIN_PATH': '/home/user/.local/lib/python3.8/site-packages/cv2/qt/plugins', 'QT_QPA_FONTDIR': '/home/user/.local/lib/python3.8/site-packages/cv2/qt/fonts', 'LD_LIBRARY_PATH': '/home/user/.local/lib/python3.8/site-packages/cv2/../../lib64:', 'POD_NAME': 'sapftm', 'PADDLE_MASTER': '127.0.1.1:50611', 'PADDLE_GLOBAL_SIZE': '16', 'PADDLE_LOCAL_SIZE': '16', 'PADDLE_GLOBAL_RANK': '13', 'PADDLE_LOCAL_RANK': '13', 'PADDLE_NNODES': '1', 'PADDLE_CURRENT_ENDPOINT': '127.0.1.1:50625', 'PADDLE_TRAINER_ID': '13', 'PADDLE_TRAINERS_NUM': '16', 'PADDLE_RANK_IN_NODE': '13', 'PADDLE_TRAINER_ENDPOINTS': '127.0.1.1:50612,127.0.1.1:50613,127.0.1.1:50614,127.0.1.1:50615,127.0.1.1:50616,127.0.1.1:50617,127.0.1.1:50618,127.0.1.1:50619,127.0.1.1:50620,127.0.1.1:50621,127.0.1.1:50622,127.0.1.1:50623,127.0.1.1:50624,127.0.1.1:50625,127.0.1.1:50626,127.0.1.1:50627', 'PADDLE_DISTRI_BACKEND': 'gloo', 'PADDLE_LOG_DIR': '/home/user/works/PaddleNLP/applications/text_classification/multi_class/log'}
LAUNCH INFO 2024-05-29 01:07:31,010 ------------------------- ERROR LOG DETAIL -------------------------
TrainProcess: 13%|█▎ | 992/7900 [37:47:27<330:09:25, 172.06s/it]] - Pre device batch size = 8
[2024-05-28 05:10:27,907] [ INFO] - Total Batch size = 128
[2024-05-28 05:42:26,468] [ INFO] - [timelog] checkpoint saving time: 0.00s (2024-05-28 05:42:26)
[2024-05-28 08:13:44,402] [ INFO] - ***** Running Evaluation *****
[2024-05-28 08:13:44,402] [ INFO] - Num examples = 10000
[2024-05-28 08:13:44,403] [ INFO] - Total prediction steps = 79
[2024-05-28 08:13:44,403] [ INFO] - Pre device batch size = 8
[2024-05-28 08:13:44,403] [ INFO] - Total Batch size = 128
[2024-05-28 08:45:38,960] [ INFO] - [timelog] checkpoint saving time: 0.00s (2024-05-28 08:45:38)
[2024-05-28 11:26:10,457] [ INFO] - ***** Running Evaluation *****
[2024-05-28 11:26:10,457] [ INFO] - Num examples = 10000
[2024-05-28 11:26:10,457] [ INFO] - Total prediction steps = 79
[2024-05-28 11:26:10,457] [ INFO] - Pre device batch size = 8
[2024-05-28 11:26:10,458] [ INFO] - Total Batch size = 128
[2024-05-28 11:58:17,243] [ INFO] - [timelog] checkpoint saving time: 0.00s (2024-05-28 11:58:17)
[2024-05-28 14:48:44,482] [ INFO] - ***** Running Evaluation *****
[2024-05-28 14:48:44,483] [ INFO] - Num examples = 10000
[2024-05-28 14:48:44,483] [ INFO] - Total prediction steps = 79
[2024-05-28 14:48:44,483] [ INFO] - Pre device batch size = 8
[2024-05-28 14:48:44,483] [ INFO] - Total Batch size = 128
[2024-05-28 15:20:58,564] [ INFO] - [timelog] checkpoint saving time: 0.00s (2024-05-28 15:20:58)
[2024-05-28 18:29:31,297] [ INFO] - ***** Running Evaluation *****
[2024-05-28 18:29:31,298] [ INFO] - Num examples = 10000
[2024-05-28 18:29:31,298] [ INFO] - Total prediction steps = 79
[2024-05-28 18:29:31,298] [ INFO] - Pre device batch size = 8
[2024-05-28 18:29:31,298] [ INFO] - Total Batch size = 128
[2024-05-28 19:01:31,771] [ INFO] - [timelog] checkpoint saving time: 0.00s (2024-05-28 19:01:31)
[2024-05-28 22:30:47,420] [ INFO] - ***** Running Evaluation *****
[2024-05-28 22:30:47,421] [ INFO] - Num examples = 10000
[2024-05-28 22:30:47,421] [ INFO] - Total prediction steps = 79
[2024-05-28 22:30:47,421] [ INFO] - Pre device batch size = 8
[2024-05-28 22:30:47,421] [ INFO] - Total Batch size = 128
[2024-05-28 23:02:40,192] [ INFO] - [timelog] checkpoint saving time: 0.00s (2024-05-28 23:02:40)
terminate called after throwing an instance of 'gloo::TimeoutException'
what(): [/paddle/third_party/gloo/gloo/transport/tcp/pair.cc:587] TIMEOUT self_rank = 13 pair_rank = 14 peer_str = [127.0.1.1]:14217
LAUNCH INFO 2024-05-29 01:07:44,368 Exit code -15
可复现,每次都会报错 gloo::TimeoutException。然后程序就退出了。
这个是因为什么造成的?需要修改gloo的超时时间吗?如果需要修改应该在哪里修改?
主要几个包的版本如下:
onnx 1.16.0
onnxruntime 1.16.3
opencv-python 4.6.0.66
opt-einsum 3.3.0
orjson 3.10.3
packaging 24.0
paddle2onnx 1.2.1
paddlefsl 1.1.0
paddlenlp 2.8.0
paddlepaddle 2.6.1
paddleslim 2.6.0
pandas 2.0.3
Pillow 7.0.0
4条答案
按热度按时间kqhtkvqz1#
从报错信息看是通信超时了,可以试试设置这个环境变量FLAGS_stop_check_timeout
export FLAGS_stop_check_timeout=xxx,以秒为单位
1hdlvixo2#
从报错信息看是通信超时了,可以试试设置这个环境变量FLAGS_stop_check_timeout export FLAGS_stop_check_timeout=xxx,以秒为单位
好的,我使用这个参数测试一下
ttp71kqs3#
从报错信息看是通信超时了,可以试试设置这个环境变量FLAGS_stop_check_timeout export FLAGS_stop_check_timeout=xxx,以秒为单位
下午添加了这个参数,将数值改到 3600 ,并启用resume_from_checkpoint继续训练,还是会报错
wtlkbnrh4#
除了FLAGS_stop_check_timeout参数,还会有其它原因吗?