Paddle 同一份模型训练代码在CUDA11.8和CUDA11.6上的运行结果差异较大

up9lanfz  于 4个月前  发布在  其他
关注(0)|答案(4)|浏览(45)

bug描述 Describe the Bug

这是我们的模型训练代码,它在CUDA 11.6和CUDA 11.8的输出差异超过了0.3。我们近一步地跟pytorch进行了对比,pytorch的输出结果与使用CUDA 11.6的几乎一致。

class Model_zpge5M_FqQx86Dj1nV7Njal5bquCfBcN(nn.Module):
    def __init__(self):
        super(Model_zpge5M_FqQx86Dj1nV7Njal5bquCfBcN, self).__init__()
        self.conv1_mutated = torch.nn.ConvTranspose2d(in_channels=1, out_channels=6, kernel_size=[5, 5], stride=[1, 1], padding=[0, 0], output_padding=[0, 0], dilation=[1, 1], groups=1, bias=True)
        self.relu1 = torch.nn.ReLU()
        self.pool1_mutated = torch.nn.MaxPool2d(kernel_size=[3, 1], stride=[2, 2], padding=[0, 0], dilation=1, ceil_mode=False)
        self.conv2_mutated = torch.nn.Conv2d(in_channels=6, out_channels=16, kernel_size=[5, 5], stride=[1, 1], padding=[1, 1], dilation=[1, 1], groups=1, bias=True)
        self.relu2_mutated = torch.ceil
        self.pool2 = torch.nn.MaxPool2d(kernel_size=[2, 2], stride=[2, 2], padding=[0, 0], dilation=1, ceil_mode=False)
        self.flatten = torch.nn.Flatten()
        self.linear1_mutated = torch.nn.Linear(in_features=672, out_features=120)
        self.relu3_mutated = torch.round
        self.linear2_mutated = torch.nn.Linear(in_features=120, out_features=84)
        self.tail_flatten = torch.nn.Flatten()
        self.tail_fc = torch.nn.Linear(in_features=84, out_features=10)

    def forward(self, input):
        conv1_output = self.conv1_mutated(input)
        relu1_output = self.relu1(conv1_output)
        maxpool1_output = self.pool1_mutated(relu1_output)
        conv2_output = self.conv2_mutated(maxpool1_output)
        relu2_output = self.relu2_mutated(conv2_output)
        maxpool2_output = self.pool2(relu2_output)
        flatten_output = self.flatten(maxpool2_output)
        fc1_output = self.linear1_mutated(flatten_output)
        relu3_output = self.relu3_mutated(fc1_output)
        fc2_output = self.linear2_mutated(relu3_output)
        tail_flatten_output = self.tail_flatten(fc2_output)
        tail_fc_output = self.tail_fc(tail_flatten_output)

        tail_fc_output = tail_fc_output
        return tail_fc_output

复现代码

https://github.com/PhyllisJi/MoCoDiff_Bug/tree/paddle-issue%2364537
其中有详细的复现步骤
我们反复运行了十次,皆有较大的差异。

输出差异

# cuda 11.6
# W0522 14:31:24.082360  3337 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 11.8
# W0522 14:31:24.083253  3337 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
# W0522 14:31:24.083271  3337 gpu_resources.cc:196] WARNING: device: 0. The installed 	Paddle is compiled with CUDA 11.8, but CUDA runtime version in your machine is 11.6, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDA version.

relu2_output.npz 0.0
maxpool2_output.npz 0.0
fc2_output.npz 1.6689300537109375e-06
conv2_output.npz 0.0
relu3_output.npz 0.0
flatten_output.npz 0.0
output.npz 1.430511474609375e-06
fc1_output.npz 1.1920928955078125e-06
# cuda 11.8
# W0522 14:19:26.351598   496 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.2, Runtime API Version: 11.8
# W0522 14:19:26.352345   496 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.

conv2_output.npz 0.0
relu2_output.npz 0.0
maxpool2_output.npz 0.0
flatten_output.npz 0.0
fc1_output.npz 0.0008640289306640625
relu3_output.npz 1.0
fc2_output.npz 0.17127180099487305
output.npz 0.3886955976486206

其他补充信息 Additional Supplementary Information

paddle版本 2.6.1

dfty9e19

dfty9e191#

您好,从这个issue 反馈的结果上看,relu2_output.npz 的结果一致,fc1_output.npz 存在 diff,这中间还经过了 pool2、flatten,可否再提供一下这两步的计算结果对比,以此确定是哪个算子的计算存在 diff

4c8rllxm

4c8rllxm2#

您好,从这个issue 反馈的结果上看,relu2_output.npz 的结果一致,fc1_output.npz 存在 diff,这中间还经过了 pool2、flatten,可否再提供一下这两步的计算结果对比,以此确定是哪个算子的计算存在 diff

已经更新
https://github.com/PhyllisJi/MoCoDiff_Bug/tree/paddle-issue%2364537

bvk5enib

bvk5enib3#

看日志,是fc1_output结果有 diff,这个就是一个简单的 linear,日志显示输入是 flatten 的输出,都是0,Linear bias 都是空的,理论上计算输出也应该是0吧?

brqmpdu1

brqmpdu14#

看日志,是fc1_output结果有 diff,这个就是一个简单的 linear,日志显示输入是 flatten 的输出,都是0,Linear bias 都是空的,理论上计算输出也应该是0吧?

我们这里的数据给的是和完全一致的pytorch版本的代码的层输出的差异,如果是0,表示完全一致或者非常接近。现在的情况是在CUDA 11.6上面和pytorch的输出基本上完全一致,但是在CUDA 11.8上的运行结果和pytorch的结果有较大的差异,跳过pytorch,可以看出来直接对比两个CUDA版本上的输出,也是会有巨大差异的。

相关问题