Paddle 多机参数服务器分布式训练出错

acruukt9  于 2022-12-31  发布在  其他
关注(0)|答案(2)|浏览(329)

请提出你的问题 Please ask your question

检测多机通信情况,显示是正常的:
server:

worker:

运行时会出错:
$ python -m paddle.distributed.launch --master=169.254.60.61:60437 --nnodes=2 train.py
LAUNCH WARNING 2022-10-25 15:17:05,740 Host ip reset to 169.254.60.61
LAUNCH INFO 2022-10-25 15:17:05,740 ----------- Configuration ----------------------
LAUNCH INFO 2022-10-25 15:17:05,740 devices: None
LAUNCH INFO 2022-10-25 15:17:05,740 elastic_level: -1
LAUNCH INFO 2022-10-25 15:17:05,740 elastic_timeout: 30
LAUNCH INFO 2022-10-25 15:17:05,740 gloo_port: 6767
LAUNCH INFO 2022-10-25 15:17:05,740 host: 169.254.60.61
LAUNCH INFO 2022-10-25 15:17:05,741 job_id: default
LAUNCH INFO 2022-10-25 15:17:05,741 legacy: False
LAUNCH INFO 2022-10-25 15:17:05,741 log_dir: log
LAUNCH INFO 2022-10-25 15:17:05,741 log_level: INFO
LAUNCH INFO 2022-10-25 15:17:05,741 master: 169.254.60.61:60437
LAUNCH INFO 2022-10-25 15:17:05,741 max_restart: 3
LAUNCH INFO 2022-10-25 15:17:05,741 nnodes: 2
LAUNCH INFO 2022-10-25 15:17:05,741 nproc_per_node: None
LAUNCH INFO 2022-10-25 15:17:05,741 rank: -1
LAUNCH INFO 2022-10-25 15:17:05,741 run_mode: collective
LAUNCH INFO 2022-10-25 15:17:05,741 server_num: None
LAUNCH INFO 2022-10-25 15:17:05,741 servers:
LAUNCH INFO 2022-10-25 15:17:05,741 trainer_num: None
LAUNCH INFO 2022-10-25 15:17:05,741 trainers:
LAUNCH INFO 2022-10-25 15:17:05,741 training_script: train.py
LAUNCH INFO 2022-10-25 15:17:05,741 training_script_args: []
LAUNCH INFO 2022-10-25 15:17:05,741 with_gloo: 0
LAUNCH INFO 2022-10-25 15:17:05,741 --------------------------------------------------
LAUNCH INFO 2022-10-25 15:17:05,745 Job: default, mode collective, replicas 2[2:2], elastic False
LAUNCH INFO 2022-10-25 15:17:05,745 Waiting peer start...
LAUNCH INFO 2022-10-25 15:17:09,023 Run Pod: fpwuxo, replicas 1, status ready
LAUNCH INFO 2022-10-25 15:17:09,035 Watching Pod: fpwuxo, replicas 1, status running
/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/executor.py:400: UserWarning: do not use standalone executor in fleet by default
warnings.warn("do not use standalone executor in fleet by default")
/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/fleet/base/fleet_base.py:125: UserWarning: init_worker() function doesn't work when use non_distributed fleet.
warnings.warn(
device worker program id: 139891026931088
I1025 15:17:09.854562 159406 multi_trainer.cc:164] MultiTrainer::InitOtherEnv Communicator is null!
terminate called after throwing an instance of 'phi::enforce::EnforceNotMet'
what(): In user code:

File "train.py", line 10, in <module>
  model.net(is_train=True)
File "/home/ubuntu/桌面/wide_and_deep_dataset/model.py", line 177, in net
  pred = wide_deep_model.forward(sparse_inputs, dense_input)
File "/home/ubuntu/桌面/wide_and_deep_dataset/model.py", line 58, in forward
  emb = paddle.static.nn.sparse_embedding(s_input, size = [1024, self.sparse_feature_dim], param_attr=paddle.ParamAttr(name="embedding"))
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/contrib/layers/nn.py", line 1188, in sparse_embedding
  helper.append_op(
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/layer_helper.py", line 44, in append_op
  return self.main_program.current_block().append_op(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/framework.py", line 3615, in append_op
  op = Operator(
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/framework.py", line 2635, in __init__
  for frame in traceback.extract_stack():

NotFoundError: Input id (227854) is not in current rows table. (at /paddle/paddle/phi/core/selected_rows_impl.h:84)
  [operator < lookup_table > error]

C++ Traceback (most recent call last):

0 paddle::framework::HogwildWorker::TrainFiles()

Error Message Summary:

FatalError: Process abort signal is detected by the operating system.
[TimeInfo: *** Aborted at 1666682230 (unix time) try "date -d @1666682230" if you are using GNU date ***]
[SignalInfo: *** SIGABRT (@0x3e800026eae) received by PID 159406 (TID 0x7f3ae8ec4700) from PID 159406 ***]

LAUNCH INFO 2022-10-25 15:17:31,067 Pod failed
LAUNCH ERROR 2022-10-25 15:17:31,067 Container failed !!!
Container rank 0 status failed cmd ['/usr/bin/python3.8', '-u', 'train.py'] code -6 log log/default.fpwuxo.0.log
env {'SHELL': '/bin/bash', 'SESSION_MANAGER': 'local/ubuntu-Precision-5820-Tower-X-Series:@/tmp/.ICE-unix/1785,unix/ubuntu-Precision-5820-Tower-X-Series:/tmp/.ICE-unix/1785', 'QT_ACCESSIBILITY': '1', 'XDG_CONFIG_DIRS': '/etc/xdg/xdg-ubuntu:/etc/xdg', 'XDG_MENU_PREFIX': 'gnome-', 'GNOME_DESKTOP_SESSION_ID': 'this-is-deprecated', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', '_CE_M': '', 'TERMINAL_EMULATOR': 'JetBrains-JediTerm', 'LANGUAGE': 'zh_CN:en_US:en', 'LC_ADDRESS': 'zh_CN.UTF-8', 'GNOME_SHELL_SESSION_MODE': 'ubuntu', 'LC_NAME': 'zh_CN.UTF-8', 'SSH_AUTH_SOCK': '/run/user/1000/keyring/ssh', 'TERM_SESSION_ID': 'c2edaa9a-8094-4523-8d8f-2bb4e93050cf', 'XMODIFIERS': '@im=ibus', 'DESKTOP_SESSION': 'ubuntu', 'LC_MONETARY': 'zh_CN.UTF-8', 'SSH_AGENT_PID': '1750', 'GTK_MODULES': 'gail:atk-bridge', 'PWD': '/home/ubuntu/桌面/wide_and_deep_dataset', 'XDG_SESSION_DESKTOP': 'ubuntu 'LOGNAME': 'ubuntu', 'XDG_SESSION_TYPE': 'x11', 'CONDA_PREFIX': '/home/ubuntu/anaconda3', 'GPG_AGENT_INFO': '/run/user/1000/gnupg/S.gpg-agent:0:1', 'XAUTHORITY': '/run/user/1000/gdm/Xauthority', 'DESKTOP_STARTUP_ID': 'gnome-shell/PyCharm Professional Edition/1799-63-ubuntu-Precision-5820-Tower-X-Series_TIME884648996', 'GJS_DEBUG_TOPICS': 'JS ERROR;JS LOG', 'WINDOWPATH': '2', 'HOME': '/home/ubuntu', 'USERNAME': 'ubuntu', 'IM_CONFIG_PHASE': '1', 'LANG': 'zh_CN.UTF-8', 'LC_PAPER': 'zh_CN.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:*.xspf=00;36:', 'XDG_CURRENT_DESKTOP': 'ubuntu:GNOME', 'VIRTUAL_ENV': '/home/ubuntu/venv', 'CONDA_PROMPT_MODIFIER': '(base) ', 'INVOCATION_ID': 'e657d89aa279488b9eef4a9e2db9b1b7', 'MANAGERPID': '1563', 'GJS_DEBUG_OUTPUT': 'stderr', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'XDG_SESSION_CLASS': 'user', 'TERM': 'xterm-256color', 'LC_IDENTIFICATION': 'zh_CN.UTF-8', '*CE_CONDA': '', 'LESSOPEN': '| /usr/bin/lesspipe %s', 'USER': 'ubuntu', 'CONDA_SHLVL': '1', 'DISPLAY': ':1', 'SHLVL': '1', 'LC_TELEPHONE': 'zh_CN.UTF-8', 'QT_IM_MODULE': 'ibus', 'LC_MEASUREMENT': 'zh_CN.UTF-8', 'PAPERSIZE': 'a4', 'POD_IP': '169.254.60.61', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'LD_LIBRARY_PATH': '/usr/local/cuda/lib64:/usr/local/lib:/home/ubuntu/nccl_2.8.4-1+cuda11.2_x86_64/include/:/~/nccl_2.8.4-1+cuda11.2_x86_64/lib', 'XDG_RUNTIME_DIR': '/run/user/1000', 'PS1': '(venv) (base) [\e]0;\u@\h: \w\a]${debian_chroot:+($debian_chroot)}[\033[01;32m]\u@\h[\033[00m]:[\033[01;34m]\w[\033[00m]$ ', 'CONDA_DEFAULT_ENV': 'base', 'LC_TIME': 'zh_CN.UTF-8', 'JOURNAL_STREAM': '8:54528', 'XDG_DATA_DIRS': '/usr/share/ubuntu:/usr/local/share/:/usr/share/:/var/lib/snapd/desktop', 'PATH': '/home/ubuntu/venv/bin:/home/ubuntu/anaconda3/bin:/home/ubuntu/anaconda3/condabin:/usr/local/cuda/bin:/home/ubuntu/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/ubuntu/anaconda3/bin', 'GDMSESSION': 'ubuntu', 'DBUS_SESSION_BUS_ADDRESS': 'unix:path=/run/user/1000/bus', 'GIO_LAUNCHED_DESKTOP_FILE_PID': '151672', 'GIO_LAUNCHED_DESKTOP_FILE': '/usr/share/applications/jetbrains-pycharm.desktop', 'LC_NUMERIC': 'zh_CN.UTF-8', '*': '/usr/bin/python3.8', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'PADDLE_MASTER': '169.254.60.61:46249', 'PADDLE_GLOBAL_SIZE': '2', 'PADDLE_LOCAL_SIZE': '1', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_TRAINER_ENDPOINTS': '169.254.60.61:35173,127.0.1.1:48943', 'PADDLE_CURRENT_ENDPOINT': '169.254.60.61:35173', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '2', 'PADDLE_RANK_IN_NODE': '0', 'FLAGS_selected_cpus': ''}
/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/executor.py:400: UserWarning: do not use standalone executor in fleet by default
warnings.warn("do not use standalone executor in fleet by default")
/home/ubuntu/.local/lib/python3.8/site-packages/paddle/distributed/fleet/base/fleet_base.py:125: UserWarning: init_worker() function doesn't work when use non_distributed fleet.
warnings.warn(
device worker program id: 139891026931088
I1025 15:17:09.854562 159406 multi_trainer.cc:164] MultiTrainer::InitOtherEnv Communicator is null!
terminate called after throwing an instance of 'phi::enforce::EnforceNotMet'
what(): In user code:

File "train.py", line 10, in <module>
  model.net(is_train=True)
File "/home/ubuntu/桌面/wide_and_deep_dataset/model.py", line 177, in net
  pred = wide_deep_model.forward(sparse_inputs, dense_input)
File "/home/ubuntu/桌面/wide_and_deep_dataset/model.py", line 58, in forward
  emb = paddle.static.nn.sparse_embedding(s_input, size = [1024, self.sparse_feature_dim], param_attr=paddle.ParamAttr(name="embedding"))
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/contrib/layers/nn.py", line 1188, in sparse_embedding
  helper.append_op(
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/layer_helper.py", line 44, in append_op
  return self.main_program.current_block().append_op(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/framework.py", line 3615, in append_op
  op = Operator(
File "/home/ubuntu/.local/lib/python3.8/site-packages/paddle/fluid/framework.py", line 2635, in __init__
  for frame in traceback.extract_stack():

NotFoundError: Input id (227854) is not in current rows table. (at /paddle/paddle/phi/core/selected_rows_impl.h:84)
  [operator < lookup_table > error]

C++ Traceback (most recent call last):

0 paddle::framework::HogwildWorker::TrainFiles()

Error Message Summary:

FatalError: Process abort signal is detected by the operating system.
[TimeInfo: *** Aborted at 1666682230 (unix time) try "date -d @1666682230" if you are using GNU date ***]
[SignalInfo: *** SIGABRT (@0x3e800026eae) received by PID 159406 (TID 0x7f3ae8ec4700) from PID 159406 ***]

LAUNCH INFO 2022-10-25 15:17:31,561 Exit code -6

运行的是paddle2.3.2 cpu版本
请问这个是哪里出问题了?

z9zf31ra

z9zf31ra1#

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看 官网API文档常见问题历史IssueAI社区 来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

mepcadol

mepcadol2#

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看 官网API文档常见问题历史IssueAI社区 来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

有人回复么

相关问题