Paddle [论文复现] 在训练阶段显存利用率随着训练进程持续增加

ev7lccsx  于 2021-11-30  发布在  Java
关注(0)|答案(9)|浏览(434)

在损失函数里使用paddle.nn.functional.conv2d(),且卷积核为自定义拉普拉斯算子,此卷积不参与参数更新。这样设置会导致显存利用率随着训练进程持续增加,最后会导致无可用显存以至于训练终止

qhhrdooz

qhhrdooz1#

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

9ceoxa92

9ceoxa922#

如有必要,我可以提供aistudio项目链接

bejyjqdl

bejyjqdl3#

可以提供下复现代么?初步判断可能是需要加 no_grad

webghufk

webghufk4#

加或者不加no_grad都会导致显存使用持续增长

class DetailAggregateLoss(nn.Layer):
    def __init__(self):
        """
        todo: 用paddle实现的此方法会导致在训练阶段显存利用率持续增长
        """
        super(DetailAggregateLoss, self).__init__()

        self.laplacian_kernel = paddle.to_tensor(
            [-1, -1, -1, -1, 8, -1, -1, -1, -1],
            dtype=paddle.float32).reshape((1, 1, 3, 3))

        # self.laplacian_kernel = np.array([-1, -1, -1, -1, 8, -1, -1, -1, -1], dtype=np.float32).reshape((3, 3))

        self.fuse_kernel = paddle.to_tensor([[6. / 10], [3. / 10], [1. / 10]],
                                            dtype=paddle.float32).reshape((1, 3, 1, 1))
        # self.fuse_kernel = np.array([[6./10], [3./10], [1./10]], dtype=np.float32).reshape((1, 3, 1, 1))

    def forward(self, boundary_logits, gtmasks):
        with paddle.no_grad():
            boundary_targets = F.conv2d(gtmasks.unsqueeze(1).astype(paddle.float32), self.laplacian_kernel, padding=1)
            boundary_targets = boundary_targets.clip(min=0)

            boundary_targets[boundary_targets > 0.1] = 1
            boundary_targets[boundary_targets <= 0.1] = 0

            boundary_targets_x2 = F.conv2d(gtmasks.unsqueeze(1).astype(paddle.float32), self.laplacian_kernel,
                                           stride=2, padding=1).clip(min=0)
            boundary_targets_x4 = F.conv2d(gtmasks.unsqueeze(1).astype(paddle.float32), self.laplacian_kernel,
                                           stride=4, padding=1).clip(min=0)
            boundary_targets_x8 = F.conv2d(gtmasks.unsqueeze(1).astype(paddle.float32), self.laplacian_kernel,
                                           stride=8, padding=1).clip(min=0)

        ...
col17t5w

col17t5w5#

主要是F.conv2d(),去掉则不会增长显存

pgvzfuti

pgvzfuti6#

且在本地或aistudio上都有这个问题,paddlepaddle版本是2.1.0

k2fxgqgv

k2fxgqgv7#

去掉是怎么写呢?

chhkpiq4

chhkpiq48#

转成np格式通过cv2的卷积方法写的卷积过程,然后把结果再转换回tensor

ocebsuys

ocebsuys9#

初步判断,是boundary_targets[boundary_targets > 0.1] = 1这样的写法引起的显存泄漏,已提PR修复,详见:#35013
临时解决方案,boundary_targets[boundary_targets > 0.1] = 1 改为 boundary_targets[boundary_targets.numpy() > 0.1] = 1

相关问题