标题：【论文复现】单机多卡训练报错
版本、环境信息：

1）PaddlePaddle版本：2.1.2
2）CPU：Intel Xeon32
3）GPU：Tesla V1004、CUDA version: 10.1.243, cuDNN version: None.None.None, Nvidia driver version: 418.67
4）系统环境：CentOS 6.10, python 3.7.0

训练信息

1）单机，多卡
2）显存信息：32480MiB
3）Operator信息

复现信息：

1.单机单卡训练正常
2.单机多卡训练出错：

python -m paddle.distributed.launch main_multi_gpu.py

问题描述：请详细描述您的问题，同步贴出报错信息、日志、可复现的代码片段

merging config from configs/swinv2_tiny_patch4_window7_224.yaml
----- Imagenet2012 image train list len = 40000
----- Imagenet2012 image val list len = 10000
1203 10:24:33 AM 
AMP: False
AUG:
  AUTO_AUGMENT: rand-m9-mstd0.5-inc1
  COLOR_JITTER: 0.4
  CUTMIX: 1.0
  CUTMIX_MINMAX: None
  MIXUP: 0.8
  MIXUP_MODE: batch
  MIXUP_PROB: 1.0
  MIXUP_SWITCH_PROB: 0.5
  RE_COUNT: 1
  RE_MODE: pixel
  RE_PROB: 0.25
BASE: ['']
DATA:
  BATCH_SIZE: 64
  BATCH_SIZE_EVAL: 8
  CROP_PCT: 0.9
  DATASET: imagenet2012
  DATA_PATH: ILSVRC2012mini
  IMAGE_SIZE: 224
  NUM_WORKERS: 8
EVAL: False
LOCAL_RANK: 0
MODEL:
  ATTENTION_DROPOUT: 0.0
  DROPOUT: 0.0
  DROP_PATH: 0.2
  NAME: swin_tiny_patch4_window7_224
  NUM_CLASSES: 1000
  PRETRAINED: None
  RESUME: None
  TRANS:
    APE: False
    EMBED_DIM: 96
    EXTRA_NORM: False
    IN_CHANNELS: 3
    MLP_RATIO: 4.0
    NUM_HEADS: [3, 6, 12, 24]
    PATCH_NORM: True
    PATCH_SIZE: 4
    QKV_BIAS: True
    QK_SCALE: None
    STAGE_DEPTHS: [2, 2, 6, 2]
    WINDOW_SIZE: 7
  TYPE: swin
NGPUS: 1
REPORT_FREQ: 50
SAVE: /root/paddlejob/workspace/output//train-20211203-10-24-26
SAVE_FREQ: 5
SEED: 42
TAG: default
TRAIN:
  ACCUM_ITER: 1
  AUTO_AUGMENT: True
  BASE_LR: 0.0005
  COLOR_JITTER: 0.4
  CUTMIX_ALPHA: 1.0
  CUTMIX_MINMAX: None
  END_LR: 5e-06
  GRAD_CLIP: 5.0
  LAST_EPOCH: 0
  LR_SCHEDULER:
    DECAY_EPOCHS: 30
    DECAY_RATE: 0.1
    MILESTONES: 30, 60, 90
    NAME: warmupcosine
  MIXUP_ALPHA: 0.8
  MIXUP_MODE: batch
  MIXUP_PROB: 1.0
  MIXUP_SWITCH_PROB: 0.5
  NUM_EPOCHS: 300
  OPTIMIZER:
    BETAS: (0.9, 0.999)
    EPS: 1e-08
    MOMENTUM: 0.9
    NAME: AdamW
  RANDOM_ERASE_COUNT: 1
  RANDOM_ERASE_MODE: pixel
  RANDOM_ERASE_PROB: 0.25
  RANDOM_ERASE_SPLIT: False
  SMOOTHING: 0.1
  WARMUP_EPOCHS: 20
  WARMUP_START_LR: 5e-07
  WEIGHT_DECAY: 0.05
VALIDATE_FREQ: 10
1203 10:24:33 AM ----- world_size = 1, local_rank = 0
1203 10:24:33 AM ----- world_size = 1, local_rank = 0
/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/parallel.py:120: UserWarning: Currently not a parallel execution environment, `paddle.distributed.init_parallel_env` will not do anything.
  "Currently not a parallel execution environment, `paddle.distributed.init_parallel_env` will not do anything."
Traceback (most recent call last):
  File "main_multi_gpu.py", line 574, in <module>
    main()
  File "main_multi_gpu.py", line 570, in main
    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 501, in spawn
    while not context.join():
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 312, in join
    self._throw_exception(error_index)
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 330, in _throw_exception
    raise Exception(msg)
Exception: 

----------------------------------------------
Process 0 terminated with the following error:
----------------------------------------------

Traceback (most recent call last):
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 261, in _func_wrapper
    result = func(*args)
  File "/root/paddlejob/workspace/code/main_multi_gpu.py", line 318, in main_worker
    model = build_model(config)
  File "/mnt/code_20211203102210/swin_transformer.py", line 772, in build_swin
    extra_norm=config.MODEL.TRANS.EXTRA_NORM)
  File "/mnt/code_20211203102210/swin_transformer.py", line 674, in __init__
    embed_dim=embed_dim)
  File "/mnt/code_20211203102210/swin_transformer.py", line 65, in __init__
    stride=patch_size)
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/nn/layer/conv.py", line 646, in __init__
    data_format=data_format)
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/nn/layer/conv.py", line 135, in __init__
    default_initializer=_get_default_param_initializer())
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 412, in create_parameter
    default_initializer)
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/layer_helper_base.py", line 374, in create_parameter
  **attr._to_kwargs(with_initializer=True))
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2895, in create_parameter
    initializer(param, self)
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/initializer.py", line 366, in __call__
    stop_gradient=True)
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2925, in append_op
    kwargs.get("stop_gradient", False))
  File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/tracer.py", line 45, in trace_op
    not stop_gradient)
NotImplementedError: (Unimplemented) Place CUDAPlace(0) is not supported. Please check that your paddle compiles with WITH_GPU, WITH_XPU or WITH_ASCEND_CL option or check that your train process set the correct device id if you use Executor. (at /paddle/paddle/fluid/platform/device_context.cc:88)
  [operator < gaussian_random > error]

2条答案

按热度按时间

q5lcpyga1#

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

赞(0）回复(0）举报 2022-04-21

pxq42qpu2#

distributed.launch 启动的话，程序里面不用写spawn了，可以参考这里的说明：https://github.com/PaddlePaddle/models/blob/tipc/docs/lwfx/ArticleReproduction_CV.md#3.12

Paddle 【论文复现】单机多卡训练报错

2条答案

相关问题

热门标签

最新问答