同一模型,使用paddle.profiler在CPU上量出来的静态图推理和动态图推理性能差距很大

00jrzges  于 5个月前  发布在  其他
关注(0)|答案(7)|浏览(46)

bug描述 Describe the Bug

  1. PD version: develop branch
  2. platform: Intel(R) Xeon(R) Platinum 8480+
  3. model: 两个算子,matmul + scale
  4. 事情描述:用动态图和静态图模式分别推理这个model 500次,其中热身5次,二者性能差异交大,具体脚本和结果如下
  5. 复现脚本
    5.1 复现过程:复制该脚本在安装了paddle cpu版本的whl上直接跑即可
import paddle
from paddle import fluid
import unittest
import numpy as np
import time
import paddle.profiler as profiler
paddle.enable_static()

class TestDynAndStatic(unittest.TestCase):
    def setUp(self):
       self.places = [fluid.CPUPlace()]

    def check_static_result(self, place):
        with fluid.program_guard(fluid.Program(), fluid.Program()):
            input_x = paddle.static.data(
                name="input_x", shape=[2048, 1024], dtype="float32"
            )
            input_y = paddle.static.data(
                name="input_y", shape=[1024, 2048], dtype="float32"
            )

            acc = paddle.matmul(input_x, input_y)
            result = paddle.nn.functional.dropout(
                x=acc, p=0.1, axis=0, training=False, mode="downscale_in_infer"
            )
            x_np = np.random.random([2048, 1024]).astype("float32")
            y_np = np.random.random([1024, 2048]).astype("float32")

            exe = fluid.Executor(place)
            prof = profiler.Profiler(
                targets=[
                    profiler.ProfilerTarget.CPU,
                    # profiler.ProfilerTarget.CUSTOM_DEVICE,
                ],
                scheduler=(5, 500),
                on_trace_ready=profiler.export_chrome_tracing("./profiler_log"),
                timer_only=False,
            )
            prof.start()
            for i in range(500):
                fetches = exe.run(
                    fluid.default_main_program(),
                    feed={"input_x": x_np, "input_y": y_np},
                    fetch_list=[result],
                )
                prof.step()
            prof.stop()
            prof.summary()
            
    def test_static(self):
        for place in self.places:
            self.check_static_result(place=place)

    # def test_dygraph(self):
    #     with fluid.dygraph.guard(self.places[0]):
    #         input_x = np.random.random([2048, 1024]).astype("float32")
    #         input_y = np.random.random([1024, 2048]).astype("float32")
    #         x = paddle.to_tensor(input_x)
    #         y = paddle.to_tensor(input_y)
    #         prof = profiler.Profiler(targets=[profiler.ProfilerTarget.CPU], scheduler = (5, 500))
    #         with prof:
    #             for i in range(500):
    #                 # start = time.time()
    #                 acc = paddle.matmul(x, y)  
    #                 result = paddle.nn.functional.dropout(
    #                     x=acc, p=0.1, axis=0, training=False, mode="downscale_in_infer"
    #                 ) 
    #                 # end = time.time()
    #                 # cost =  f"{(end - start)*1000:.7f}"
    #                 # print(f"[inference][{i+1}/100]: start: {start:.7f} end:{end:.7f} cost:{cost:>13} ms")
    #                 prof.step()
    #         prof.summary(time_unit='ms')   
if __name__ == "__main__":
    unittest.main()
  1. 动态图和静态图结果如下,可以看出静态图比动态图慢了许多,matmul和scale都慢了许多
    动态图:

静态图:

其他补充信息 Additional Supplementary Information

No response

ajsxfq5m

ajsxfq5m1#

hello, any comments?

zvokhttg

zvokhttg2#

@zhanglirong1999 Hi, can you please take a look at this issue. It seems static mode cannot offload the operators to oneDNN by default.

42fyovps

42fyovps3#

其实我觉得或许是static graph本身的实现上比dyn graph哪里有点缺陷,因为我发现在其它device上,也是static graph比dynamic graph更慢

nvbavucw

nvbavucw4#

@Zhiwei35 ,这边问题已经复现,我们会尽快跟进

dgenwo3n

dgenwo3n5#

@Zhiwei35 ,看起来是paddle自身静态图的执行过程中对num thread进行了限制,具体可见: https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/new_executor/interpreter/execution_config.cc#L51
因此静态图在4个core运行效果是最好的,如果高于4个core,效率确实会低于动态图
可参照设置4 core命令为:
KMP_AFFINITY=granularity=fine,verbose,compact,1,0 && export KMP_BLOCKTIME=1 && export KMP_SETTINGS=1 && export OMP_NUM_THREADS=4 && numactl --membind=0 --physcpubind=0-3 python xxx.py

gcuhipw9

gcuhipw96#

@zhanglirong1999 感谢! @vivienfanghuagood 请问这种设置是有一些考虑嘛?像这个issue的模型,我个人认为如果放开num threads,那么static graph的性能应该可以持平dyn graph甚至超过,那么这个设置会不会不是那么general?

0vvn1miw

0vvn1miw7#

@zhanglirong1999 感谢! @vivienfanghuagood 请问这种设置是有一些考虑嘛?像这个issue的模型,我个人认为如果放开num threads,那么static graph的性能应该可以持平dyn graph甚至超过,那么这个设置会不会不是那么general?

如果放开thread num,性能会有提升吗?

相关问题