同一模型，使用paddle.profiler在CPU上量出来的静态图推理和动态图推理性能差距很大

00jrzges 于 5个月前发布在其他

关注(0)|答案(7)|浏览(46)

bug描述 Describe the Bug

PD version: develop branch
platform: Intel(R) Xeon(R) Platinum 8480+
model: 两个算子，matmul + scale
事情描述：用动态图和静态图模式分别推理这个model 500次，其中热身5次，二者性能差异交大，具体脚本和结果如下
复现脚本
5.1 复现过程：复制该脚本在安装了paddle cpu版本的whl上直接跑即可

import paddle
from paddle import fluid
import unittest
import numpy as np
import time
import paddle.profiler as profiler
paddle.enable_static()

class TestDynAndStatic(unittest.TestCase):
    def setUp(self):
       self.places = [fluid.CPUPlace()]

    def check_static_result(self, place):
        with fluid.program_guard(fluid.Program(), fluid.Program()):
            input_x = paddle.static.data(
                name="input_x", shape=[2048, 1024], dtype="float32"
            )
            input_y = paddle.static.data(
                name="input_y", shape=[1024, 2048], dtype="float32"
            )

            acc = paddle.matmul(input_x, input_y)
            result = paddle.nn.functional.dropout(
                x=acc, p=0.1, axis=0, training=False, mode="downscale_in_infer"
            )
            x_np = np.random.random([2048, 1024]).astype("float32")
            y_np = np.random.random([1024, 2048]).astype("float32")

            exe = fluid.Executor(place)
            prof = profiler.Profiler(
                targets=[
                    profiler.ProfilerTarget.CPU,
                    # profiler.ProfilerTarget.CUSTOM_DEVICE,
                ],
                scheduler=(5, 500),
                on_trace_ready=profiler.export_chrome_tracing("./profiler_log"),
                timer_only=False,
            )
            prof.start()
            for i in range(500):
                fetches = exe.run(
                    fluid.default_main_program(),
                    feed={"input_x": x_np, "input_y": y_np},
                    fetch_list=[result],
                )
                prof.step()
            prof.stop()
            prof.summary()
            
    def test_static(self):
        for place in self.places:
            self.check_static_result(place=place)

    # def test_dygraph(self):
    #     with fluid.dygraph.guard(self.places[0]):
    #         input_x = np.random.random([2048, 1024]).astype("float32")
    #         input_y = np.random.random([1024, 2048]).astype("float32")
    #         x = paddle.to_tensor(input_x)
    #         y = paddle.to_tensor(input_y)
    #         prof = profiler.Profiler(targets=[profiler.ProfilerTarget.CPU], scheduler = (5, 500))
    #         with prof:
    #             for i in range(500):
    #                 # start = time.time()
    #                 acc = paddle.matmul(x, y)  
    #                 result = paddle.nn.functional.dropout(
    #                     x=acc, p=0.1, axis=0, training=False, mode="downscale_in_infer"
    #                 ) 
    #                 # end = time.time()
    #                 # cost =  f"{(end - start)*1000:.7f}"
    #                 # print(f"[inference][{i+1}/100]: start: {start:.7f} end:{end:.7f} cost:{cost:>13} ms")
    #                 prof.step()
    #         prof.summary(time_unit='ms')   
if __name__ == "__main__":
    unittest.main()

动态图和静态图结果如下，可以看出静态图比动态图慢了许多，matmul和scale都慢了许多
动态图：

静态图：

其他补充信息 Additional Supplementary Information

No response

Paddle

来源：https://github.com/PaddlePaddle/Paddle/issues/56922

7条答案

按热度按时间

ajsxfq5m1#

hello, any comments?

赞(0）回复(0）举报 5个月前

zvokhttg2#

@zhanglirong1999 Hi, can you please take a look at this issue. It seems static mode cannot offload the operators to oneDNN by default.

赞(0）回复(0）举报 5个月前

42fyovps3#

其实我觉得或许是static graph本身的实现上比dyn graph哪里有点缺陷，因为我发现在其它device上，也是static graph比dynamic graph更慢

赞(0）回复(0）举报 5个月前

nvbavucw4#

@Zhiwei35 ,这边问题已经复现，我们会尽快跟进

赞(0）回复(0）举报 5个月前

dgenwo3n5#

@Zhiwei35 ,看起来是paddle自身静态图的执行过程中对num thread进行了限制，具体可见: https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/new_executor/interpreter/execution_config.cc#L51
因此静态图在4个core运行效果是最好的，如果高于4个core，效率确实会低于动态图
可参照设置4 core命令为：
KMP_AFFINITY=granularity=fine,verbose,compact,1,0 && export KMP_BLOCKTIME=1 && export KMP_SETTINGS=1 && export OMP_NUM_THREADS=4 && numactl --membind=0 --physcpubind=0-3 python xxx.py

赞(0）回复(0）举报 5个月前

gcuhipw96#

@zhanglirong1999 感谢！ @vivienfanghuagood 请问这种设置是有一些考虑嘛？像这个issue的模型，我个人认为如果放开num threads，那么static graph的性能应该可以持平dyn graph甚至超过，那么这个设置会不会不是那么general？

赞(0）回复(0）举报 5个月前

0vvn1miw7#

如果放开thread num，性能会有提升吗？

赞(0）回复(0）举报 5个月前