bug描述 Describe the Bug
- PD version: develop branch
- platform: Intel(R) Xeon(R) Platinum 8480+
- model: 两个算子,matmul + scale
- 事情描述:用动态图和静态图模式分别推理这个model 500次,其中热身5次,二者性能差异交大,具体脚本和结果如下
- 复现脚本
5.1 复现过程:复制该脚本在安装了paddle cpu版本的whl上直接跑即可
import paddle
from paddle import fluid
import unittest
import numpy as np
import time
import paddle.profiler as profiler
paddle.enable_static()
class TestDynAndStatic(unittest.TestCase):
def setUp(self):
self.places = [fluid.CPUPlace()]
def check_static_result(self, place):
with fluid.program_guard(fluid.Program(), fluid.Program()):
input_x = paddle.static.data(
name="input_x", shape=[2048, 1024], dtype="float32"
)
input_y = paddle.static.data(
name="input_y", shape=[1024, 2048], dtype="float32"
)
acc = paddle.matmul(input_x, input_y)
result = paddle.nn.functional.dropout(
x=acc, p=0.1, axis=0, training=False, mode="downscale_in_infer"
)
x_np = np.random.random([2048, 1024]).astype("float32")
y_np = np.random.random([1024, 2048]).astype("float32")
exe = fluid.Executor(place)
prof = profiler.Profiler(
targets=[
profiler.ProfilerTarget.CPU,
# profiler.ProfilerTarget.CUSTOM_DEVICE,
],
scheduler=(5, 500),
on_trace_ready=profiler.export_chrome_tracing("./profiler_log"),
timer_only=False,
)
prof.start()
for i in range(500):
fetches = exe.run(
fluid.default_main_program(),
feed={"input_x": x_np, "input_y": y_np},
fetch_list=[result],
)
prof.step()
prof.stop()
prof.summary()
def test_static(self):
for place in self.places:
self.check_static_result(place=place)
# def test_dygraph(self):
# with fluid.dygraph.guard(self.places[0]):
# input_x = np.random.random([2048, 1024]).astype("float32")
# input_y = np.random.random([1024, 2048]).astype("float32")
# x = paddle.to_tensor(input_x)
# y = paddle.to_tensor(input_y)
# prof = profiler.Profiler(targets=[profiler.ProfilerTarget.CPU], scheduler = (5, 500))
# with prof:
# for i in range(500):
# # start = time.time()
# acc = paddle.matmul(x, y)
# result = paddle.nn.functional.dropout(
# x=acc, p=0.1, axis=0, training=False, mode="downscale_in_infer"
# )
# # end = time.time()
# # cost = f"{(end - start)*1000:.7f}"
# # print(f"[inference][{i+1}/100]: start: {start:.7f} end:{end:.7f} cost:{cost:>13} ms")
# prof.step()
# prof.summary(time_unit='ms')
if __name__ == "__main__":
unittest.main()
- 动态图和静态图结果如下,可以看出静态图比动态图慢了许多,matmul和scale都慢了许多
动态图:
静态图:
其他补充信息 Additional Supplementary Information
No response
7条答案
按热度按时间ajsxfq5m1#
hello, any comments?
zvokhttg2#
@zhanglirong1999 Hi, can you please take a look at this issue. It seems static mode cannot offload the operators to oneDNN by default.
42fyovps3#
其实我觉得或许是static graph本身的实现上比dyn graph哪里有点缺陷,因为我发现在其它device上,也是static graph比dynamic graph更慢
nvbavucw4#
@Zhiwei35 ,这边问题已经复现,我们会尽快跟进
dgenwo3n5#
@Zhiwei35 ,看起来是paddle自身静态图的执行过程中对num thread进行了限制,具体可见: https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/new_executor/interpreter/execution_config.cc#L51
因此静态图在4个core运行效果是最好的,如果高于4个core,效率确实会低于动态图
可参照设置4 core命令为:
KMP_AFFINITY=granularity=fine,verbose,compact,1,0 && export KMP_BLOCKTIME=1 && export KMP_SETTINGS=1 && export OMP_NUM_THREADS=4 && numactl --membind=0 --physcpubind=0-3 python xxx.py
gcuhipw96#
@zhanglirong1999 感谢! @vivienfanghuagood 请问这种设置是有一些考虑嘛?像这个issue的模型,我个人认为如果放开num threads,那么static graph的性能应该可以持平dyn graph甚至超过,那么这个设置会不会不是那么general?
0vvn1miw7#
@zhanglirong1999 感谢! @vivienfanghuagood 请问这种设置是有一些考虑嘛?像这个issue的模型,我个人认为如果放开num threads,那么static graph的性能应该可以持平dyn graph甚至超过,那么这个设置会不会不是那么general?
如果放开thread num,性能会有提升吗?