- For output activation of ops, slim will insert QuantizeLinear and DequantizeLinear operation which named after
quantize_linear
anddequantize_linear
.
- For trainable parameters of ops, slim first quantizes and saves the weight in low bits, and then inserts an dequantize operation before entering op for calculation.
quantize_linear
The quantization process requires two parameters, scale and zero_point, both of which are 1-dimensional tensors.
The quantitative formula is: y=saturate(round(x/scale)+zero_point)
.
*Attributes:
quant_axis
: INT32, optional.
In the per-axis
quantification method, the axis on which the dimension is quantified. If this property is not set, the quantization method defaults to per-layer
quantization. For convolution input [batch, channel, H, W]
, when the quantization method is channel-wise
, quant_axis
is 1.
bit_length
: INT32, default is 8.
The number of bits occupied by the quantized numerical precision, currently only supports quantization as a signed integer.
*Inputs:
X
: FP32.Scale
: FP32
When the quantization method is layer-wise
, the size of the Scale
is 1. When the quantization method is axis-wise
, the size of the Scale
is equal to the size of the input tensor in the axis dimension.
- ZeroPoint: INT32, optional.
The size is the same as the Scale
. The value range depends on bit_length
, when bit_length
is 8, the value range of the Tensor is within the value range of Int8.
*Outputs:
Y
: INT32.
The shape of Y is the same as X. The value range depends on bit_length
, when bit_length
is 8, the value range of the Tensor is within the value range of Int8.
dequantize_linear
According toscale
,zero_point
andquant_axis
, the low-precision value Tensor is inversely quantized into a high-precision value Tensor.
The de-quantitative formula is:y=(x-zero_point)*scale
.
*Attributes:
quant_axis
: INT32, optional.
In the per-axis
de-quantification method, the axis on which the dimension is de-quantified. If this property is not set, the de-quantization method defaults to per-layer
de-quantization. For convolution input [batch, channel, H, W]
, when the de-quantization method is channel-wise
, quant_axis
is 1.
bit_length
: INT32, default is 8.
The number of bits occupied by the quantized numerical precision, currently only supports quantization as a signed integer.
*Inputs:
X
: INT32.
The value range depends on bit_length
, when bit_length
is 8, the value range of the Tensor is within the value range of Int8.
Scale
: FP32
When the de-quantization method is layer-wise
, the size of the Scale
is 1. When the de-quantization method is axis-wise
, the size of the Scale
is equal to the size of the input tensor in the axis dimension.
- ZeroPoint: INT32, optional.
The size is the same as the Scale
. The value range depends on bit_length
, when bit_length
is 8, the value range of the Tensor is within the value range of Int8.
*Outputs:
Y
: FP32.
The shape of Y is the same as X.
Testing model: http: https://paddle-inference-dist.bj.bcebos.com/temp/quantized_mobilenetv1.tar.gz
Refer to the definition of quantization operation in ONNX:
quantizelinear
DequantizeLinear
11条答案
按热度按时间tuwxkamq1#
您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快~
Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API,FAQ,Github Issue and AI community to get the answer.Have a nice day!
zujrkrfu2#
https://github.com/onnx/onnx/blob/master/docs/Operators.md#quantizelinear
s4n0splo3#
p5fdfcr14#
Hi, @baoachun I can not download this model, the URL looks not reachable from outside. Could you please try drag the model into comment box?
g0czyy6m5#
quantize_linear detected.
channel-wise quantization detected.
scale_var_name: batch_norm_8.tmp_4.scale scale_value: [0.001]
channel-wise show up in dequantize_linear op
zf9nrax16#
#38437
https://github.com/lidanqing-intel/Paddle/tree/develop-new-quant-strategy
oneDNN support
Paddle/paddle/fluid/framework/ir/quant_conv2d_dequant_fuse_pass.cc
Line 384 in c396ee6
| | IR_NODE_LINK_TO(input_act, quantized_node); |
mo49yndu7#
@wozna
yfwxisqw8#
python->C++ move to new PR: #38643
8mmmxcuj9#
@lidanqing-intel@wozna I am wondering whether we can refer to the quantization pass of the GPU to do the mkldnn quantization pass scheme. Because there are a lot of weight processing in the current Python script, it is not in line with the design function of pass, and the code is difficult to maintain. For example, this pass involves transferring data between multiple passes. As far as I know, this situation is currently not supported. I can only save the information in the graph. In addition, there are many weight operations in the pass, such as reshape, transpose, etc., which will bring a huge workload to our later maintenance. In view of this, I still hope that we re-discuss and plan the implementation plan. Thanks!
cfh9epnr10#
Tobe refined:
Concerns:
Steps:
We do this in passes. We get scales and zero-point. We should put already these values as attributes of this op.
Second pass. Remove all fake quant and fake dequant.
Running all mkldnn int8 passes. And at the end of whole process, we can propogate the scales that is propogatable. Look for typical ops (stored propogatable_op_lists) How to mark in the graph. Then we need dequantize op somewhere or use force_fp32_output. Only question is how to propogate the scales, it could be done in similar way in quant2, but could be different, because before we will have done the propogation, we were saving scales for each tensor. Even reshape and transpose will also be done like this. Before we put input and output scale in tensor, but now where do we put the scale. The only concern is that is worrying the method that how to finish int8 pattern. How to finalize int8 calculation and reorder to fp32.
fp32 mkldnn passes (actually int8 mkldnn passes, which means we should already consider ops fusion caused scales recalculation)
insert quantize, dequantize around convs, cpu_quantize_pass, (Before we get tensor with scales from quant2_mkldnn_pass, now we don't use tensor, have to reconsider, currently it should be in target op In_Scale attribute value) We will still need cpu_quantize_squash_pass to safely finalize all int8 calculation
eufgjt7s11#
Confirmed: oneDNN does support asymetric quantization. https://oneapi-src.github.io/oneDNN/dev_guide_convolution.html
Now adding zero-points into Paddle conv mkldnn op
https://github.com/oneapi-src/oneDNN/blob/master/tests/benchdnn/doc/knobs_attr.md
We should just reuse GPU GPU passes: