背景:
环境:
机器: 8 P40物理机
docker: paddlecloud镜像: iregistry.baidu-int.com/paddlecloud/base-images:paddlecloud-ubuntu20.04-gcc8.2-cuda11.8-cudnn8.9-openmpi4.1.5-codelab1.6.1.5-bccl2.15.5.4-hadoop2.2.4.2-afsshell1.9.3.4095
paddlepaddle-gpu: 2.6.1
paddle.utils.run_check() 单卡和多卡正常
训练方式: fleet + 静态图
单卡训练无问题
问题:
双卡及以上训练报错
报错日志:
env {'SHELL': '/bin/bash', 'NV_LIBCUBLAS_VERSION': '11.11.3.6-1', 'NVIDIA_VISIBLE_DEVICES': 'all', 'NV_NVML_DEV_VERSION': '11.8.86-1', 'NV_CUDNN_PACKAGE_NAME': 'libcudnn8', 'NV_LIBNCCL_DEV_PACKAGE': 'libnccl-dev=2.16.2-1+cuda11.8', 'NV_LIBNCCL_DEV_PACKAGE_VERSION': '2.16.2-1', 'HOSTNAME': 'XXXXXX', 'LANGUAGE': 'en_US.UTF-8', 'NVIDIA_REQUIRE_CUDA': 'cuda>=11.8 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=510,driver<511 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=tesla,driver>=515,driver<516 brand=unknown,driver>=515,driver<516 brand=nvidia,driver>=515,driver<516 brand=nvidiartx,driver>=515,driver<516 brand=geforce,driver>=515,driver<516 brand=geforcertx,driver>=515,driver<516 brand=quadro,driver>=515,driver<516 brand=quadrortx,driver>=515,driver<516 brand=titan,driver>=515,driver<516 brand=titanrtx,driver>=515,driver<516', 'NV_LIBCUBLAS_DEV_PACKAGE': 'libcublas-dev-11-8=11.11.3.6-1', 'NV_NVTX_VERSION': '11.8.86-1', 'NV_CUDA_CUDART_DEV_VERSION': '11.8.89-1', 'NV_LIBCUSPARSE_VERSION': '11.7.5.86-1', 'NV_LIBNPP_VERSION': '11.8.0.86-1', 'NCCL_VERSION': '2.16.2-1', 'PWD': '/root/work/baidu/map-navi-rec/travel-recommend', 'NV_CUDNN_PACKAGE': 'libcudnn8=8.9.0.131-1+cuda11.8', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'WITH_AVX': 'ON', 'NV_NVPROF_DEV_PACKAGE': 'cuda-nvprof-11-8=11.8.87-1', 'NV_LIBNPP_PACKAGE': 'libnpp-11-8=11.8.0.86-1', 'NV_LIBNCCL_DEV_PACKAGE_NAME': 'libnccl-dev', 'TZ': 'Asia/Shanghai', 'NV_LIBCUBLAS_DEV_VERSION': '11.11.3.6-1', 'BASH': '/bin/sh', 'NVIDIA_PRODUCT_NAME': 'CUDA', 'NV_LIBCUBLAS_DEV_PACKAGE_NAME': 'libcublas-dev-11-8', 'NV_CUDA_CUDART_VERSION': '11.8.89-1', 'HOME': '/root', 'LANG': 'en_US.UTF-8', 'CUDA_VERSION': '11.8.0', 'NV_LIBCUBLAS_PACKAGE': 'libcublas-11-8=11.11.3.6-1', 'NVIDIA_TOOLS': '/home/opt/cuda_tools', 'NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE': 'cuda-nsight-compute-11-8=11.8.0-1', 'NV_LIBNPP_DEV_PACKAGE': 'libnpp-dev-11-8=11.8.0.86-1', 'GOROOT': '/usr/local/go', 'NV_LIBCUBLAS_PACKAGE_NAME': 'libcublas-11-8', 'NV_LIBNPP_DEV_VERSION': '11.8.0.86-1', 'OPENMPI_HOME': '/usr/local/openmpi-4.1.5', 'WITH_GPU': 'ON', 'TERM': 'xterm', 'NV_LIBCUSPARSE_DEV_VERSION': '11.7.5.86-1', 'HADOOP_HOME': '/root/paddlejob/hadoop-client/hadoop', 'LIBRARY_PATH': '/usr/local/cuda/lib64/stubs', 'NV_CUDNN_VERSION': '8.9.0.131', 'SHLVL': '2', 'HOME_WORK_DIR': '/root/paddlejob', 'NV_CUDA_LIB_VERSION': '11.8.0-1', 'NVARCH': 'x86_64', 'CUDNN_VERSION': '8.6.0', 'NV_CUDNN_PACKAGE_DEV': 'libcudnn8-dev=8.9.0.131-1+cuda11.8', 'NV_CUDA_COMPAT_PACKAGE': 'cuda-compat-11-8', 'NV_LIBNCCL_PACKAGE': 'libnccl2=2.16.2-1+cuda11.8', 'LD_LIBRARY_PATH': '/usr/local/lib:/usr/local/openmpi-4.1.5/lib:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/:/usr/lib/x86_64-linux-gnu/:/usr/lib/:/usr/lib64:/usr/local/cuda-11.8/targets/x86_64-linux/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64', 'NV_CUDA_NSIGHT_COMPUTE_VERSION': '11.8.0-1', 'NV_NVPROF_VERSION': '11.8.87-1', 'LC_ALL': 'en_US.UTF-8', 'PATH': '/root/work/tools/ripgrep-13.0.0-x86_64-unknown-linux-musl/:/root/work/tools/ripgrep-13.0.0-x86_64-unknown-linux-musl/:/usr/local/bin:/usr/local/openmpi-4.1.5/bin:/home/cmake-3.16.0-Linux-x86_64/bin:/home/opt/cuda_tools:/bin:/bin:/root/paddlejob/hadoop-client/hadoop/bin:/usr/local/gcc-8.2/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/sbin:/usr/bin:/sbin:/bin', 'NV_LIBNCCL_PACKAGE_NAME': 'libnccl2', 'NV_LIBNCCL_PACKAGE_VERSION': '2.16.2-1', 'DEBIAN_FRONTEND': 'noninteractive', 'OLDPWD': '/root/work', 'GOPATH': '/root/gopath', '_': '/usr/local/bin/python', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'POD_NAME': 'bapotg', 'PADDLE_MASTER': '10.255.75.25:42010', 'PADDLE_GLOBAL_SIZE': '2', 'PADDLE_LOCAL_SIZE': '2', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '1', 'PADDLE_CURRENT_ENDPOINT': '10.255.75.25:42011', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '2', 'PADDLE_RANK_IN_NODE': '0', 'PADDLE_TRAINER_ENDPOINTS': '10.255.75.25:42011,10.255.75.25:42012', 'FLAGS_selected_gpus': '0', 'PADDLE_LOG_DIR': '/root/work/baidu/map-navi-rec/travel-recommend/log'}
LAUNCH INFO 2024-07-19 17:18:34,227 ------------------------- ERROR LOG DETAIL -------------------------
exe.run(
ValueError: In user code:
File "/root/work/baidu/map-navi-rec/travel-recommend/paddle_infp/train_v3.py", line 360, in <module>
main(args)
File "/root/work/baidu/map-navi-rec/travel-recommend/paddle_infp/train_v3.py", line 349, in main
train(conf, args.dataset_dir, args.dataset_type, args.out_dir, args.log_dir, resume=args.resume)
File "/root/work/baidu/map-navi-rec/travel-recommend/paddle_infp/train_v3.py", line 122, in train
optimizer.minimize(avg_cost)
File "/usr/local/lib/python3.8/dist-packages/paddle/distributed/fleet/fleet.py", line 1551, in minimize
return self._minimize_impl(
File "/usr/local/lib/python3.8/dist-packages/paddle/distributed/fleet/fleet.py", line 1786, in _minimize_impl
optimize_ops, params_grads = meta_optimizer.minimize(
File "/usr/local/lib/python3.8/dist-packages/paddle/distributed/fleet/meta_optimizers/meta_optimizer_base.py", line 103, in minimize
optimize_ops, params_grads = self.minimize_impl(
File "/usr/local/lib/python3.8/dist-packages/paddle/distributed/fleet/meta_optimizers/raw_program_optimizer.py", line 185, in minimize_impl
self._transpile_main_program(loss)
File "/usr/local/lib/python3.8/dist-packages/paddle/distributed/fleet/meta_optimizers/raw_program_optimizer.py", line 290, in _transpile_main_program
self._allreduce_fusion_program()
File "/usr/local/lib/python3.8/dist-packages/paddle/distributed/fleet/meta_optimizers/raw_program_optimizer.py", line 503, in _allreduce_fusion_program
block._insert_op_without_sync(
File "/usr/local/lib/python3.8/dist-packages/paddle/base/framework.py", line 4507, in _insert_op_without_sync
op = Operator(block=self, desc=op_desc, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/paddle/base/framework.py", line 3016, in __init__
for frame in traceback.extract_stack():
InvalidArgumentError: The start row index must be less than the end row index.But received the start index = 0, the end index = 0.
[Hint: Expected begin_idx < end_idx, but received begin_idx:0 >= end_idx:0.] (at /paddle/paddle/phi/core/dense_tensor_impl.cc:309)
[operator < coalesce_tensor > error]
C++ Traceback (most recent call last):
0 paddle::framework::ScopePool::Clear()
1 paddle::framework::ScopePool::DeleteScope(paddle::framework::Scope*)
2 paddle::framework::Scope::~Scope()
3 paddle::framework::Scope::DropKids()
4 paddle::framework::Scope::~Scope()
5 paddle::framework::Variable::PlaceholderImplphi::SelectedRows::~PlaceholderImpl()
Error Message Summary:
FatalError: Segmentation fault
is detected by the operating system.
[TimeInfo: *** Aborted at 1721380713 (unix time) try "date -d @1721380713" if you are using GNU date ***]
[SignalInfo: *** SIGSEGV (@0xa) received by PID 12438 (TID 0x7f966677c740) from PID 10 ***]
LAUNCH INFO 2024-07-19 17:18:36,845 Exit code -11
4条答案
按热度按时间ccgok5k51#
具体模型是什么呢?有比较小的可复现代码么?
66bbxpm52#
具体模型是什么呢?有比较小的可复现代码么?
模型: 3层MLP
训练代码
启动脚本:
a6b3iqyw3#
多卡训练,我们推荐使用新的自动并行方式,写法会更简单。具体可参考: https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/paddle_v3_features/auto_parallel_cn.html
vh0rcniy4#
多卡训练,我们推荐使用新的自动并行方式,写法会更简单。具体可参考: https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/paddle_v3_features/auto_parallel_cn.html
会学习下, 不过先请帮看下问题吧, 大部分代码暂时不能升3.0, 也不知道升3是否能解决问题。