paddle.reciprocal数值不稳定导致结果出错

mepcadol  于 5个月前  发布在  其他
关注(0)|答案(2)|浏览(50)

bug描述 Describe the Bug

这是我们的模型训练代码,它在reciprocal层的输出与pytorch有很大差异

class Model_1715507020(nn.Layer):
    def __init__(self):
        super(Model_1715507020, self).__init__()
        self.conv1_mutated = paddle.nn.Conv2DTranspose(in_channels=1, out_channels=6, kernel_size=[5, 5], stride=[1, 1], padding=[0, 0], output_padding=[0, 0], dilation=[1, 1], groups=1, bias_attr=None)
        self.relu1 = paddle.nn.ReLU()
        self.pool1 = paddle.nn.MaxPool2D(kernel_size=[2, 2], stride=[2, 2], padding=[0, 0], ceil_mode=False)
        self.conv2_mutated = paddle.nn.Conv2D(in_channels=6, out_channels=16, kernel_size=[6, 8], stride=[1, 1], padding=[0, 0], dilation=[1, 1], groups=1, bias_attr=None)
        self.relu2_mutated = paddle.nn.Softsign()
        self.pool2 = paddle.nn.MaxPool2D(kernel_size=[2, 2], stride=[2, 2], padding=[0, 0], ceil_mode=False)
        self.flatten = paddle.nn.Flatten()
        self.linear1_mutated = paddle.nn.Linear(in_features=320, out_features=120)
        self.relu3 = paddle.nn.ReLU()
        self.linear2 = paddle.nn.Linear(in_features=120, out_features=84)
        self.relu4_mutated = paddle.reciprocal
        self.tail_flatten = paddle.nn.Flatten()
        self.tail_fc = paddle.nn.Linear(in_features=84, out_features=10)

    def forward(self, input):
        conv1_output = self.conv1_mutated(input)
        relu1_output = self.relu1(conv1_output)
        maxpool1_output = self.pool1(relu1_output)
        conv2_output = self.conv2_mutated(maxpool1_output)
        relu2_output = self.relu2_mutated(conv2_output)
        maxpool2_output = self.pool2(relu2_output)
        flatten_output = self.flatten(maxpool2_output)
        fc1_output = self.linear1_mutated(flatten_output)
        relu3_output = self.relu3(fc1_output)
        fc2_output = self.linear2(relu3_output)
        relu4_output = self.relu4_mutated(fc2_output)
        tail_flatten_output = self.tail_flatten(relu4_output)
        tail_fc_output = self.tail_fc(tail_flatten_output)
        tail_fc_output = tail_fc_output
        return tail_fc_output

输出差异

fc2_output.npz 0.00019089877605438232
relu4_output.npz 2942823.25
output.npz 717287.0

paddle的结果与其他几个框架都不尽相同

梯度也与pytorch不一致

tail_fc.bias: 梯度数据不一致, 差值:0.004240369889885187
conv1_mutated.bias: 梯度数据不一致, 差值:38971648.0
linear1_mutated.bias: 梯度数据不一致, 差值:154666512.0
linear2.weight: 梯度数据不一致, 差值:271438272.0
linear1_mutated.weight: 梯度数据不一致, 差值:41292152.0
conv1_mutated.weight: 梯度数据不一致, 差值:109647968.0
tail_fc.weight: 梯度数据不一致, 差值:2942.853759765625
conv2_mutated.bias: 梯度数据不一致, 差值:348457472.0
conv2_mutated.weight: 梯度数据不一致, 差值:223159072.0
linear2.bias: 梯度数据不一致, 差值:682456960.0

复现代码

https://github.com/PhyllisJi/MoCoDiff_Bug/tree/paddle-issue%2364606
其中有详细的复现步骤

其他补充信息 Additional Supplementary Information

paddle版本 2.6.1

1sbrub3j

1sbrub3j1#

你好,感谢你提供详细的反馈~ 但是基于你提供的复现代码,我使用Paddle2.6.1并不能复现“paddle输出与pytorch不一致”的错误。按照仓库的README,我得到的结果如下:

关于梯度对齐的部分,我查看了 grad_diff.py ,发现代码中读取的存储梯度数值的npz文件在执行 layer_diff.pygrad_diff.py 时,内容并不会得到更新:

https://github.com/PhyllisJi/MoCoDiff_Bug/blob/paddle-issue%2364606/paddle_bug/grad_diff.py#L29

因此,检查梯度对齐的代码似乎并不能正确地完成任务。如果我的理解有错误,请纠正我~

tuwxkamq

tuwxkamq2#

这是我们使用的环境:

得到结果如下

Standard Output of pytorch:
 [[ -149.9006       64.48611     122.98726   ...  -122.14338
     24.066904    -16.06955  ]
 [ -971.5917      321.81796    -112.56958   ...  -145.75206
    734.03406     741.72766  ]
 [  783.29205   -1165.4011    -2521.5647    ...  2660.2996
  -2216.659     -1177.0262   ]
 ...
 [  -15.6884      272.8186      454.46573   ...  -109.99846
    606.9621      308.7502   ]
 [  -13.2395935   624.59094     343.71335   ...    77.2698
   -248.51326     -40.880722 ]
 [ -390.64597      10.669411    -34.001915  ...   -86.67963
   -251.22133     137.6695   ]]

Standard Output of paddle:
 [[ -143.9055       60.53962     118.71148   ...  -118.83249
     21.616867    -11.237774 ]
 [-1156.8036      404.81512    -164.51259   ...  -156.30447
    910.6938      843.2613   ]
 [  839.06305   -1234.182     -2680.8772    ...  2829.9927
  -2365.5557    -1257.5698   ]
 ...
 [  -21.183792    275.54163     487.86768   ...  -133.49532
    643.05023     329.28925  ]
 [   25.303818    649.6233      332.02115   ...   101.692856
   -293.76935     -12.417603 ]
 [ -394.30847      11.5594635   -33.06631   ...   -87.16174
   -252.81967     139.54375  ]]

3
relu4_output.npz 2942823.25
fc2_output.npz 0.00019089877605438232
output.npz 717287.0

仓库也已更新,现在执行 layer_diff.py 时会更新存储梯度数值的npz文件

相关问题