Paddle In AMP training, eval prog use FP32 instead of FP16

pbossiut  于 2023-02-04  发布在  其他
关注(0)|答案(5)|浏览(196)

Thank you for contributing to PaddlePaddle.
Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before.
If there is no solution,please make sure that this is a training issue including the following details:
System information
-PaddlePaddle version: v2.2
-CPU: None
-GPU: V100 CUDA11.4
-OS Platform: Ubuntu 20.04
-Other imformation: None
Note: You can get most of the information by running summary_env.py .

To Reproduce
Run RN50-static-train with AMP-O1 configuration.
Print "eval_prog" and check the log.

Describe your current behavior
When testing RN50 AMP training, I found that eval program use FP32 instead of FP16, and the throughput of evaluation is much lower than training.
This may be due to some operations being pruned during eval_prog.clone(for_test=True) .
As a workaround, I can manually call rewrite_program to cast ops to FP16 in eval prog.
logically, train prog and eval prog are both supposed to use mixed precision in AMP training. And this issue would not only appear in RN50, but also in all AMP training with eval(test) program.
So I think this may be a design bug.

Code to reproduce the issue
Clone PaddleClas and prepare relevant training environment.

git clone https://github.com/PaddlePaddle/PaddleClas.git

Insert code to train.py#L167 to print program:

print(train_prog)
print(eval_prog)

Then run training with AMP-O1 config:

bash ./ppcls/static/run_dali.sh

Then check the log. In train prog, some variables are casted to "float16". But in eval prog, all variables are "float32". And the throughput of evaluation is much lower than training, which is unreasonable.

Manually call rewrite_program here train.py#L167 to cast ops to FP16 in eval prog.

from paddle.fluid.contrib.mixed_precision.fp16_utils import rewrite_program
from paddle.fluid.contrib.mixed_precision.fp16_lists import AutoMixedPrecisionLists
rewrite_program(eval_prog, amp_lists=AutoMixedPrecisionLists())

Then print the eval prog and check the log. Some variables are casted to FP16 and the throughput of evaluation becomes higher than training. This is more like the result I expected.

Other info / logs
In eval prog, all variables are "float32".:

After rewrite eval prog, some variables are casted to "float16" according to white/black list.

rlcwz9us

rlcwz9us1#

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看 官网API文档常见问题历史IssueAI社区 来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

cgh8pdjw

cgh8pdjw2#

Thanks for your work, it‘s important for us. We will check on that bug and fix it. Please keep attention.

yuvru6vn

yuvru6vn3#

That's right, AMP is used to accelerate training, which has no effect in eval mode.
AMP is a lossy acceleration method, however the loss should be very slight.
supplement: AMP is a training accelerate method. In training phase, some operators use FP16. After training, Paddle will still save fp32 params to the model.
If you use amp in eval phase, the inference result will be different. To avoid the accuracy difference, we still use fp32 in eval phase.

k4aesqcs

k4aesqcs4#

@zhiqiu 请帮忙看一下这个问题的解决方案

a0zr77ik

a0zr77ik5#

This issue may cause some troubles to users.
For example, in QAT training, we need to add some temporary variables to the program according to OPs. The added variables would be float16 type in QAT train program, and float32 type in QAT eval program. If the datatype is different, error will be reported during execution. We have to modify the name of variables and manually insert cast OP to run.

相关问题