unilm LayoutLM - Cuda设备序数错误

jc3wubiy 于 5个月前发布在其他

关注(0)|答案(3)|浏览(59)

Bug描述

我们正在运行文件
unilm/layoutlmft/examples/run_xfun_re.py
但是，我们遇到了错误
RuntimeError: CUDA error: invalid device ordinal
torch._C._cuda_setDevice(device)
我使用的模型是：
LayoutLM
问题出现在使用以下情况时：

官方示例脚本：(详细信息如下)
我自己的修改过的脚本：(详细信息如下)

一个清晰简洁的关于bug是什么的描述。
我们正在按照说明中描述的相同命令运行文件 unilm/layoutlmft/examples/run_xfun_re.py,没有做任何更改。
然而，我们得到了错误 RuntimeError: CUDA error: invalid device ordinal torch._C._cuda_setDevice(device)
软件版本：
Python 3.7.10
CUDA版本 10.2
PyTorch版本 1.8.0
TorchVision 0.9.0

python -m torch.distributed.launch --nproc_per_node=4 examples/run_xfun_re.py \
--model_name_or_path microsoft/layoutxlm-base \
--output_dir /tmp/test-ner \
--do_train \
--do_eval \
--lang zh \
--max_steps 2500 \
--per_device_train_batch_size 2 \
--warmup_ratio 0.1 \
--fp16

重现步骤

重现行为所需的步骤：

使用与此处记录的相同命令运行程序：https://github.com/microsoft/unilm/tree/master/layoutxlm#fine-tuning-for-relation-extraction
执行开始并快速终止，出现错误 RuntimeError: CUDA error: invalid device ordinal torch._C._cuda_setDevice(device)

预期行为

我们期望训练能够通过，因为我们尝试在没有任何更改的情况下运行示例代码，并使用相同的命令。

平台：Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-1058-aws x86_64v)
AMI:深度学习AMI(Ubuntu 18.04)版本49.0
Python版本：3.7.10
PyTorch版本(GPU?):1.8.0带有GPU
Cuda版本10.2
TorchVision 0.9.0

unilm

来源：https://github.com/microsoft/unilm/issues/511

3条答案

按热度按时间

ogq8wdun1#

更正：
我们意识到需要修改requirements.txt文件。然后，我们使用以下命令安装detectron2 0.6:

python -m pip install 'git+ [https://github.com/facebookresearch/detectron2.git](https://github.com/facebookresearch/detectron2.git) '

由于detectron 0.6需要torch版本，因此我们需要更改torch版本。此外，我们必须尝试0.6版本，因为0.3版本无法安装。
在尝试安装Detectron2 0.3时遇到了以下问题(这就是为什么我们在上面使用0.6的原因)。
当我们尝试使用torch 1.7.1和torchvision 0.8.2进行安装时，会出现以下情况：

python -m pip install detectron2==0.3 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.7/index.html
Looking in links: https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.7/index.html
ERROR: Could not find a version that satisfies the requirement detectron2==0.3 (from versions: none)
ERROR: No matching distribution found for detectron2==0.3

我们尝试使用torch 1.7.1和torchvision 0.8.2进行安装时，会出现以下情况：

python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.7/index.html
Looking in links: https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.7/index.html
ERROR: Could not find a version that satisfies the requirement detectron2 (from versions: none)
ERROR: No matching distribution found for detectron2

此外，我们还发现了实际的GPU类型：NVIDIA K80 GPUs。这是一个p2.xlarge示例类型。

赞(0）回复(0）举报 5个月前

nimxete22#

根本原因不是版本问题。
请查看_setup_devices函数的源代码https://huggingface.co/transformers/v3.3.1/_modules/transformers/training_args.html。

修复方法：

如果您只有一个GPU,在运行示例脚本之前，请执行以下命令：