Paddle [Feature] Adaptation of the new quantization method for mkldnn.

11dmarpk  于 2022-04-21  发布在  Java
关注(0)|答案(11)|浏览(174)
  1. For output activation of ops, slim will insert QuantizeLinear and DequantizeLinear operation which named after quantize_linear and dequantize_linear.

  1. For trainable parameters of ops, slim first quantizes and saves the weight in low bits, and then inserts an dequantize operation before entering op for calculation.

  1. quantize_linear
    The quantization process requires two parameters, scale and zero_point, both of which are 1-dimensional tensors.

The quantitative formula is: y=saturate(round(x/scale)+zero_point).

*Attributes:

  • quant_axis: INT32, optional.

In the per-axis quantification method, the axis on which the dimension is quantified. If this property is not set, the quantization method defaults to per-layer quantization. For convolution input [batch, channel, H, W], when the quantization method is channel-wise, quant_axis is 1.

  • bit_length: INT32, default is 8.

The number of bits occupied by the quantized numerical precision, currently only supports quantization as a signed integer.

*Inputs:

  • X: FP32.
  • Scale: FP32

When the quantization method is layer-wise, the size of the Scale is 1. When the quantization method is axis-wise, the size of the Scale is equal to the size of the input tensor in the axis dimension.

  • ZeroPoint: INT32, optional.

The size is the same as the Scale. The value range depends on bit_length, when bit_length is 8, the value range of the Tensor is within the value range of Int8.

*Outputs:

  • Y: INT32.

The shape of Y is the same as X. The value range depends on bit_length, when bit_length is 8, the value range of the Tensor is within the value range of Int8.

  1. dequantize_linear
    According to scale, zero_point and quant_axis, the low-precision value Tensor is inversely quantized into a high-precision value Tensor.
    The de-quantitative formula is: y=(x-zero_point)*scale.

*Attributes:

  • quant_axis: INT32, optional.

In the per-axis de-quantification method, the axis on which the dimension is de-quantified. If this property is not set, the de-quantization method defaults to per-layer de-quantization. For convolution input [batch, channel, H, W], when the de-quantization method is channel-wise, quant_axis is 1.

  • bit_length: INT32, default is 8.

The number of bits occupied by the quantized numerical precision, currently only supports quantization as a signed integer.

*Inputs:

  • X: INT32.

The value range depends on bit_length, when bit_length is 8, the value range of the Tensor is within the value range of Int8.

  • Scale: FP32

When the de-quantization method is layer-wise, the size of the Scale is 1. When the de-quantization method is axis-wise, the size of the Scale is equal to the size of the input tensor in the axis dimension.

  • ZeroPoint: INT32, optional.

The size is the same as the Scale. The value range depends on bit_length, when bit_length is 8, the value range of the Tensor is within the value range of Int8.

*Outputs:

  • Y: FP32.

The shape of Y is the same as X.

Testing model: http: https://paddle-inference-dist.bj.bcebos.com/temp/quantized_mobilenetv1.tar.gz

Refer to the definition of quantization operation in ONNX:
quantizelinear
DequantizeLinear

tuwxkamq

tuwxkamq1#

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

zujrkrfu

zujrkrfu2#

https://github.com/onnx/onnx/blob/master/docs/Operators.md#quantizelinear

s4n0splo

s4n0splo3#

  1. Change the save_quant_model to fit this new quant model
  2. Then transform python file to C++ pass.
p5fdfcr1

p5fdfcr14#

Hi, @baoachun I can not download this model, the URL looks not reachable from outside. Could you please try drag the model into comment box?

g0czyy6m

g0czyy6m5#

quantize_linear detected.
channel-wise quantization detected.
scale_var_name: batch_norm_8.tmp_4.scale scale_value: [0.001]
channel-wise show up in dequantize_linear op

zf9nrax1

zf9nrax16#

  • python->C++ refactoring PR:

#38437

  • Adapt new quant-aware model (have bug in save graph with removing quant-dequant op)

https://github.com/lidanqing-intel/Paddle/tree/develop-new-quant-strategy
oneDNN support

+    conv_attr.set_zero_points(DNNL_ARG_SRC, /* mask */ 0, {1});
+    conv_attr.set_zero_points(DNNL_ARG_DST, /* mask */ 0, {1});
+    conv_attr.set_zero_points(DNNL_ARG_WEIGHTS, /* mask */ 0, {1});
  • GPU solution reference:

Paddle/paddle/fluid/framework/ir/quant_conv2d_dequant_fuse_pass.cc

Line 384 in c396ee6

| | IR_NODE_LINK_TO(input_act, quantized_node); |

yfwxisqw

yfwxisqw8#

python->C++ move to new PR: #38643

8mmmxcuj

8mmmxcuj9#

@lidanqing-intel@wozna I am wondering whether we can refer to the quantization pass of the GPU to do the mkldnn quantization pass scheme. Because there are a lot of weight processing in the current Python script, it is not in line with the design function of pass, and the code is difficult to maintain. For example, this pass involves transferring data between multiple passes. As far as I know, this situation is currently not supported. I can only save the information in the graph. In addition, there are many weight operations in the pass, such as reshape, transpose, etc., which will bring a huge workload to our later maintenance. In view of this, I still hope that we re-discuss and plan the implementation plan. Thanks!

cfh9epnr

cfh9epnr10#

Tobe refined:
Concerns:

  1. Ask issue about Scale propagation
  2. zero-point issue, it is fine,dnnl::reorder support it.
+    conv_attr.set_zero_points(DNNL_ARG_SRC, /* mask */ 0, {1});
+    conv_attr.set_zero_points(DNNL_ARG_DST, /* mask */ 0, {1});
+    conv_attr.set_zero_points(DNNL_ARG_WEIGHTS, /* mask */ 0, {1});
  1. GRU does not support signed int8. We need to recalculate. Is it true that the quant model input is always singed int8. Even it is unsigned, it will be noted through zero-point. What if we get some value different from 128. That will only mean quant model is wrong, hence We should use "assert zero-point 128" in GRU quant model.
  2. Those passes and its influence on collected scales. Feel like now we have to change some passes. Like squash, scale etc. Like for scale_conv fuse passes. we have to recalculate the scales. So we will need int8 passes ?? We will have to figure out how many passes will be influence the existing changes.

Steps:

  1. We do this in passes. We get scales and zero-point. We should put already these values as attributes of this op.

  2. Second pass. Remove all fake quant and fake dequant.

  3. Running all mkldnn int8 passes. And at the end of whole process, we can propogate the scales that is propogatable. Look for typical ops (stored propogatable_op_lists) How to mark in the graph. Then we need dequantize op somewhere or use force_fp32_output. Only question is how to propogate the scales, it could be done in similar way in quant2, but could be different, because before we will have done the propogation, we were saving scales for each tensor. Even reshape and transpose will also be done like this. Before we put input and output scale in tensor, but now where do we put the scale. The only concern is that is worrying the method that how to finish int8 pattern. How to finalize int8 calculation and reorder to fp32.

  4. fp32 mkldnn passes (actually int8 mkldnn passes, which means we should already consider ops fusion caused scales recalculation)

  5. insert quantize, dequantize around convs, cpu_quantize_pass, (Before we get tensor with scales from quant2_mkldnn_pass, now we don't use tensor, have to reconsider, currently it should be in target op In_Scale attribute value) We will still need cpu_quantize_squash_pass to safely finalize all int8 calculation

eufgjt7s

eufgjt7s11#

Confirmed: oneDNN does support asymetric quantization. https://oneapi-src.github.io/oneDNN/dev_guide_convolution.html
Now adding zero-points into Paddle conv mkldnn op
https://github.com/oneapi-src/oneDNN/blob/master/tests/benchdnn/doc/knobs_attr.md

We should just reuse GPU GPU passes:

const std::vector<std::string> kTRTSubgraphPasses({
  "conv_affine_channel_fuse_pass",  //
      "adaptive_pool2d_convert_global_pass",
      "conv_eltwiseadd_affine_channel_fuse_pass",  //
      "shuffle_channel_detect_pass",               //
      "quant_conv2d_dequant_fuse_pass",            //
      "delete_quant_dequant_op_pass",              //
      "delete_quant_dequant_filter_op_pass",       //
      // "fc_fuse_pass",                                 //
      "simplify_with_basic_ops_pass",           //
      "embedding_eltwise_layernorm_fuse_pass",  //
      "multihead_matmul_fuse_pass_v2",          //
      "multihead_matmul_fuse_pass_v3",          //
      "skip_layernorm_fuse_pass",               //
      "conv_bn_fuse_pass",                      //
      "unsqueeze2_eltwise_fuse_pass",           //
      "squeeze2_matmul_fuse_pass",              //
      "reshape2_matmul_fuse_pass",              //
      "flatten2_matmul_fuse_pass",              //
      "map_matmul_v2_to_mul_pass",              //
      "map_matmul_v2_to_matmul_pass",           //
      "map_matmul_to_mul_pass",                 //
      "fc_fuse_pass",                           //
      "conv_elementwise_add_fuse_pass",         //
      "add_support_int8_pass",
      "tensorrt_subgraph_pass",  //
      "conv_bn_fuse_pass",       //

# if CUDNN_VERSION >= 7100  // To run conv_fusion, the version of cudnn must be

                           // guaranteed at least v7
// cudnn8.0 has memory leak problem in conv + eltwise + act, so we
// disable the pass.

# if !(CUDNN_VERSION >= 8000 && CUDNN_VERSION < 8100)

      "conv_elementwise_add_act_fuse_pass",   //
      "conv_elementwise_add2_act_fuse_pass",  //

# endif

# endif

      "transpose_flatten_concat_fuse_pass",
});

相关问题