Paddle XPU上训练模型中断后总会留下僵尸进程

xyhw6mcr  于 4个月前  发布在  其他
关注(0)|答案(2)|浏览(41)

bug描述 Describe the Bug

环境

CPU:飞腾S2500
系统:麒麟V10
xpu:昆仑k200

问题描述

在昆仑K200上跑模型训练代码,每次开始训练的时候就会卡住,debug发现卡在opt.step()那一步。
注:代码在windows+gpu环境中验证过,运行没有问题

for epoch_id in range(EPOCH_NUM):
    for batch_id, data in enumerate(data_loader()):
        images, labels = data
        predicts = model(images)
        # * compute the loss
        loss = paddle.nn.functional.pairwise_distance(predicts, paddle.cast(labels, dtype='float32'))
        avg_loss = paddle.mean(loss)
        # * back propagation
        avg_loss.backward()
        opt.step() # **卡在这里**
        opt.clear_grad()

卡住后没有报错,也没有别的反应,没办法只能ctrl+c或者kill pid中断。中断后xpu_smi显示加速卡里还有进程,

DEVICES
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   PCI Addr   | Model |        SN        |    INODE   | State | UseRate |     L3     |     Memory     | Power(W) | Temp | Freq(MHz) | Firmware Version | CPLD Version |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0000:04:00.0 | K200  | 0200210903000116 | /dev/xpu0  |     E |   100 % | 13 / 16 MB | 1178 / 8064 MB |       39 |   59 |       900 | 0001.0016.0021   |       ce3002 |
| 0000:04:00.0 | K200  | 0200210903000116 | /dev/xpu1  |     E |     0 % |  0 / 16 MB |    0 / 8064 MB |       39 |   59 |       900 | 0001.0016.0021   |       ce3002 |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  PROCESSES
------------------------------------------------------------
|   Device   |  PID  | Streams |  L3   | Memory  | Command |
------------------------------------------------------------
| /dev/xpu0  | 40040 | 1       | 13 MB | 1178 MB | python  |
------------------------------------------------------------
(paddle) [root@localhost ~]# kill 40040
-bash: kill: (40040) - 没有那个进程

试过kill或者fuser都不行,提示没有对应的进程或者没有输出。
如果直接再次运行程序就会报runtime error。

(External) constant XDNN Error, XDNN_RUNTIME_ERROR  (at /root/Paddle/paddle/phi/kernels/xpu/full_kernel.cc:45)
  File "/root/CSSPaddleVer/dsth.py", line 29, in __init__
    nn.Conv2D(self.input_channel, 32, 5, stride=1, padding=2),
  File "/root/CSSPaddleVer/dsth.py", line 66, in <module>
    dsthModel = DSTH(h, w, c)
OSError: (External) constant XDNN Error, XDNN_RUNTIME_ERROR  (at /root/Paddle/paddle/phi/kernels/xpu/full_kernel.cc:45)

其他补充信息 Additional Supplementary Information

No response

eyh26e7m

eyh26e7m1#

您好,从Paddle 2.3版本开始不再支持昆仑一代芯片K200,请问您这里的安装包和使用环境是谁给您这里提供的支持,请联系对应的负责人确认下,谢谢!

anhgbhbe

anhgbhbe2#

您好,从Paddle 2.3版本开始不再支持昆仑一代芯片K200,请问您这里的安装包和使用环境是谁给您这里提供的支持,请联系对应的负责人确认下,谢谢!

您好,我确认了一下,之前的paddle确实是用2.5的源码编译出来的,所以我试了一下按照官网的教程用release/2.2分支进行编译,但是似乎卡在了xpu相关依赖下载的那一步(具体语句如下),手动输入链接的话也显示无法打开。请问有其他可用的链接吗?或者可以直接提供可用的whl包吗?我给贵团队发了邮件暂时还没收到回复。

third_party/xpu/src/extern_xpu-stamp/extern_xpu-download: third_party/xpu/src/extern_xpu-stamp/extern_xpu-mkdir
        @$(CMAKE_COMMAND) -E cmake_echo_color --switch=$(COLOR) --blue --bold --progress-dir=/root/Paddle/build/CMakeFiles --progress-num=$(CMAKE_PROGRESS_4) "Performing download step for 'extern_xpu'"
        cd /root/Paddle/build/third_party/xpu/src/extern_xpu && wget https://baidu-kunlun-public.su.bcebos.com/paddle_depence/pack_paddle_depence.sh && bash pack_paddle_depence.sh https://baidu-kunlun-product.cdn.bcebos.com/KL-SDK/klsdk-dev/20210921/xre-kylin_aarch64.tar.gz xre-kylin_aarch64 https://baidu-kunlun-product.cdn.bcebos.com/KL-SDK/klsdk-dev/20210921/xdnn-kylin_aarch64.tar.gz xdnn-kylin_aarch64 https://baidu-kunlun-product.cdn.bcebos.com/KL-SDK/klsdk-dev/20210623/xccl-kylin_aarch64.tar.gz xccl-kylin_aarch64

相关问题