tensorflow 无法使用TF-TRT转换显式Q/DQ节点,

qyzbxkaa  于 6个月前  发布在  其他
关注(0)|答案(5)|浏览(44)

问题类型

Bug

你是否在TF nightly版本中复现了这个bug?

源代码

二进制文件

Tensorflow版本

TF 2.10

自定义代码

OS平台和发行版

Ubuntu 20.04.5 LTS

移动设备

  • 无响应*

Python版本

3.8.10

Bazel版本

  • 无响应*

GCC/编译器版本

  • 无响应*

CUDA/cuDNN版本

CUDA 11.8

GPU型号和内存

  • 无响应*

当前行为?

我正在尝试使用TF-TRT将一个量化的TF模型转换为其他格式,但是以下问题阻止了我这样做。我已经尝试了一个临时的解决方法来修复问题#1,但对于下一个问题,我找不到可能的解决方案。根据PR #52248,当使用Tensor-RT 8时,Tensorflow应该支持显式Q/DQ模型。

  • 问题1*

在Tensorflow模型中添加量化-反量化节点的非弃用方法是通过tf.quantization.quantize_and_dequantize_v2。然而,这会添加tensorflow/tensorflow/python/ops/array_ops.py中的以下行:
b6517cc | @tf_export("quantization.quantize_and_dequantize_v2") | @dispatch.add_dispatch_support | defquantize_and_dequantize_v2( | input, # pylint: disable=redefined-builtin | nodes with tag QuantizeAndDequantizeV4 tensorflow/tensorflow/core/ops/array_ops.cc Line 2916 in 6285a27 | REGISTER_OP("QuantizeAndDequantizeV4") | which显然不在显式精度模式下的op列表中。tensorflow/tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.h Lines 35 to 39 in 7103c2c | // Operations with supported conversion to Q/DQ ops in TensorRT explicit | // precision mode. | constexpr std::array<constchar*, 1> kExplicitQuantizationOpNames = { | "QuantizeAndDequantizeV2", | }; A可能的解决方法是使用已弃用的API tf.quantization.quantize_and_dequantize,它将添加一个仍然受支持的QuantizeAndDequantizeV2节点。

  • 问题2*

在执行上述解决方法后,我遇到了第二个错误(可能是由于使用不当)。要将使用TensorRT显式保存的TF模型转换为其他格式,我正在遵循Nvidia TF-TRT文档中提供的示例。然而,我遇到了一些失败的tensorRT引擎转换问题和一些警告,我在日志输出中附上了这些警告。预期行为显式量化的Tensorflow保存的模型应该可以与TF-TRT一起转换。

独立代码以重现问题

**Custom quantized keras layer to build an example model**

import tensorflow as tf
from tensorflow import keras

class CustomConv2D(keras.layers.Layer):
    def __init__(self, filters, kernel_size, name="CustomConv2d"):
        super(CustomConv2D, self).__init__()
        self.w = self.add_weight(
            shape=(kernel_size, kernel_size, filters, filters), 
            initializer="random_normal", 
            dtype="float32", 
            name=self.name+"_weights", 
            trainable=True
        )
    
    def call(self, inputs):
        # Using the deprecated quantize_and_dequantize here since quantize_and_dequantize_v2 is listed as unsupported-ops by TF-TRT
        q_i = tf.quantization.quantize_and_dequantize(inputs, 0, 1, name=self.name+"_q_i", narrow_range=True)
        q_w = tf.quantization.quantize_and_dequantize(self.w, -1, 1, name=self.name+"q_w",narrow_range=True)
        return tf.nn.conv2d(q_i, q_w, 2, "SAME")
    

l = CustomConv2D(64, 3)
t = tf.random.normal((1, 224, 224, 64), dtype="float32")

model = tf.keras.Sequential()
model.add(tf.keras.layers.InputLayer(input_shape=(224, 224, 64)))
for i in range(5):
    model.add(CustomConv2D(64, 3, name=f'custom_conv2d_{i}'))

model.save('./saved_model_qat/')

**Code used for converting saved quantized TF model using TF-TRT** 
```python
from tensorflow.python.compiler.tensorrt import trt_convert as trt
converter = trt.TrtGraphConverterV2(
input_saved_model_dir='saved_model_qat',
precision_mode=trt.TrtPrecisionMode.INT8,
use_calibration=False
)
trt_func = converter.convert()
converter.summary()

x_test = tf.ones((2, 224, 224, 64))

MAX_BATCH_SIZE=2
def input_fn():
batch_size = MAX_BATCH_SIZE
x = x_test[0:batch_size, :]
yield [x]

converter.build(input_fn=input_fn)
### Relevant log output

```shell
Logs generated 

1. when using `quantize_and_dequantize_v2` instead of `quantize_and_dequantize` at the `trt_func = converter.convert()` step

INFO:tensorflow:Clearing prior device assignments in loaded saved model
INFO:tensorflow:Automatic mixed precision will be used on the whole TensorFlow Graph. This behavior can be deactivated using the environment variable: TF_TRT_EXPERIMENTAL_FEATURES=deactivate_mixed_precision.
More information can be found on: https://www.tensorflow.org/guide/mixed_precision.
2023-02-16 12:12:02.652333: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-16 12:12:02.653263: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 1
2023-02-16 12:12:02.653394: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session
2023-02-16 12:12:02.653696: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-16 12:12:02.654338: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-16 12:12:02.655044: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-16 12:12:02.655721: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-16 12:12:02.656447: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-16 12:12:02.657032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20554 MB memory:  -> device: 0, name: NVIDIA A10, pci bus id: 0000:04:00.0, compute capability: 8.6
2023-02-16 12:12:02.668533: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:2359] Running auto_mixed_precision graph optimizer
2023-02-16 12:12:02.675439: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1195] Automatic Mixed Precision Grappler Pass Summary:

Total processable nodes: 46
Recognized nodes available for conversion: 11
Total nodes converted: 6
Total FP16 Cast ops used (excluding Const and Variable casts): 10
Allowlisted nodes converted: 5
Denylisted nodes blocking conversion: 0
Nodes blocked from conversion by denylisted nodes: 0

For more information regarding mixed precision training, including how to make automatic mixed precision aware of a custom op type, please see the documentation available here:
https://docs.nvidia.com/deeplearning/frameworks/tensorflow-user-guide/index.html#tfamp

2023-02-16 12:12:02.682088: W tensorflow/compiler/tf2tensorrt/segment/segment.cc:952] 

################################################################################
TensorRT unsupported/non-converted OP Report:
        - QuantizeAndDequantizeV4 -> 10x
        - Conv2D -> 5x
        - NoOp -> 2x
        - Identity -> 1x
        - Placeholder -> 1x
--------------------------------------------------------------------------------
        - Total nonconverted OPs: 19
        - Total nonconverted OP Types: 5
For more information see https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#supported-ops.
################################################################################

2023-02-16 12:12:02.682177: W tensorflow/compiler/tf2tensorrt/segment/segment.cc:1280] The environment variable TF_TRT_MAX_ALLOWED_ENGINES=20 has no effect since there are only 0 TRT Engines with  at least minimum_segment_size=3 nodes.
2023-02-16 12:12:02.682195: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:799] Number of TensorRT candidate segments: 0

2. when using `quantize_and_dequantize` at the `trt_func = converter.convert()` step
```shell
INFO:tensorflow:Clearing prior device assignments in loaded saved model
INFO:tensorflow:Automatic mixed precision will be used on the whole TensorFlow Graph. This behavior can be deactivated using the environment variable: TF_TRT_EXPERIMENTAL_FEATURES=deactivate_mixed_precision.
More information can be found on: https://www.tensorflow.org/guide/mixed_precision.
2023-02-16 12:14:27.966166: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-16 12:14:27.966825: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 1
2023-02-16 12:14:27.966960: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session
2023-02-16 12:14:27.967241: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-16 12:14:27.967877: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-16 12:14:27.968493: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-16 12:14:27.969167: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-16 12:14:27.969781: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-16 12:14:27.970355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20554 MB memory:  -> device: 0, name: NVIDIA A10, pci bus id: 0000:04:00.0, compute capability: 8.6
2023-02-16 12:14:27.981918: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:2359] Running auto_mixed_precision graph optimizer
2023-02-16 12:14:27.988304: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1195] Automatic Mixed Precision Grappler Pass Summary:

Total processable nodes: 46
Recognized nodes available for conversion: 11
Total nodes converted: 6
Total FP16 Cast ops used (excluding Const and Variable casts): 10
Allowlisted nodes converted: 5
Denylisted nodes blocking conversion: 0
Nodes blocked from conversion by denylisted nodes: 0

For more information regarding mixed precision training, including how to make automatic mixed precision aware of a custom op type, please see the documentation available here:
https://docs.nvidia.com/deeplearning/frameworks/tensorflow-user-guide/index.html#tfamp

2023-02-16 12:14:27.993404: I tensorflow/compiler/tf2tensorrt/convert/trt_optimization_pass.cc:206] [TF-TRT] Using explicit QDQ mode
2023-02-16 12:14:27.994965: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_1/custom_conv2d_1q1_w has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-16 12:14:27.995142: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_2/custom_conv2d_2q1_w has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-16 12:14:27.995296: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_3/custom_conv2d_3q1_w has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-16 12:14:27.995449: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_4/custom_conv2d_4q1_w has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-16 12:14:27.995603: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_5/custom_conv2d_5q1_w has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-16 12:14:27.995626: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_1/custom_conv2d_1_q1_i has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-16 12:14:27.995681: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_2/custom_conv2d_2_q1_i has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-16 12:14:27.995710: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_3/custom_conv2d_3_q1_i has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-16 12:14:27.995737: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_4/custom_conv2d_4_q1_i has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-16 12:14:27.995764: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_5/custom_conv2d_5_q1_i has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-16 12:14:27.995803: W tensorflow/compiler/tf2tensorrt/segment/segment.cc:952] 

################################################################################
TensorRT unsupported/non-converted OP Report:
        - Conv2D -> 5x
        - NoOp -> 2x
        - Identity -> 1x
        - Placeholder -> 1x
--------------------------------------------------------------------------------
        - Total nonconverted OPs: 9
        - Total nonconverted OP Types: 4
For more information see https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#supported-ops.
################################################################################

2023-02-16 12:14:27.995933: W tensorflow/compiler/tf2tensorrt/segment/segment.cc:1280] The environment variable TF_TRT_MAX_ALLOWED_ENGINES=20 has no effect since there are only 5 TRT Engines with  at least minimum_segment_size=3 nodes.
2023-02-16 12:14:27.995954: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:799] Number of TensorRT candidate segments: 5
2023-02-16 12:14:27.997440: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:916] Replaced segment 0 consisting of 3 nodes by TRTEngineOp_000_000.
2023-02-16 12:14:27.997478: W tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:919] TF-TRT Warning: Cannot replace segment 1 consisting of 16 nodes by TRTEngineOp_000_001 reason: Segment has no inputs (possible constfold failure) (keeping original segment).
2023-02-16 12:14:27.997532: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:916] Replaced segment 2 consisting of 3 nodes by TRTEngineOp_000_002.
2023-02-16 12:14:27.997588: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:916] Replaced segment 3 consisting of 3 nodes by TRTEngineOp_000_003.
2023-02-16 12:14:27.997641: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:916] Replaced segment 4 consisting of 3 nodes by TRTEngineOp_000_004.
2023-02-16 12:14:28.000505: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1533] No allowlist ops found, nothing to do
2023-02-16 12:14:28.002105: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1533] No allowlist ops found, nothing to do
2023-02-16 12:14:28.003662: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1533] No allowlist ops found, nothing to do
2023-02-16 12:14:28.005172: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1533] No allowlist ops found, nothing to do

如上所示,Conv2D节点奇怪地没有被TF-TRT转换。

  1. 下一步converter.build(input_fn=input_fn)引发更多错误
2023-02-16 12:16:44.399623: I tensorflow/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8700

2023-02-16 12:16:44.493304: I tensorflow/compiler/tf2tensorrt/common/utils.cc:104] Linked TensorRT version: 8.5.1
2023-02-16 12:16:44.493380: I tensorflow/compiler/tf2tensorrt/common/utils.cc:106] Loaded TensorRT version: 8.5.1
2023-02-16 12:16:46.885047: W tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:83] TF-TRT Warning: DefaultLogger The NetworkDefinitionCreationFlag::kEXPLICIT_PRECISION flag has been deprecated and has no effect. Please do not use this flag when creating the network.
2023-02-16 12:16:46.886238: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_2/custom_conv2d_2_q1_i has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-16 12:16:46.985131: W tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:1103] TF-TRT Warning: Engine creation for TRTEngineOp_000_000 failed. The native segment will be used instead. Reason: INTERNAL: tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:217 TRT_ENSURE_OK failure:
  INTERNAL: ./tensorflow/compiler/tf2tensorrt/convert/ops/layer_utils.h:610 TRT_ENSURE failure
2023-02-16 12:16:46.985391: W tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:936] TF-TRT Warning: Engine retrieval for input shapes: [[2,112,112,64]] failed. Running native segment for TRTEngineOp_000_000
2023-02-16 12:16:49.308912: W tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:83] TF-TRT Warning: DefaultLogger The NetworkDefinitionCreationFlag::kEXPLICIT_PRECISION flag has been deprecated and has no effect. Please do not use this flag when creating the network.
2023-02-16 12:16:49.310048: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_3/custom_conv2d_3_q1_i has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-16 12:16:49.406425: W tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:1103] TF-TRT Warning: Engine creation for TRTEngineOp_000_002 failed. The native segment will be used instead. Reason: INTERNAL: tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:217 TRT_ENSURE_OK failure:
  INTERNAL: ./tensorflow/compiler/tf2tensorrt/convert/ops/layer_utils.h:610 TRT_ENSURE failure
2023-02-16 12:16:49.406586: W tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:936] TF-TRT Warning: Engine retrieval for input shapes: [[2,56,56,64]] failed. Running native segment for TRTEngineOp_000_002
2023-02-16 12:16:51.735398: W tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:83] TF-TRT Warning: DefaultLogger The NetworkDefinitionCreationFlag::kEXPLICIT_PRECISION flag has been deprecated and has no effect. Please do not use this flag when creating the network.
2023-02-16 12:16:51.736541: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_4/custom_conv2d_4_q1_i has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-16 12:16:51.835118: W tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:1103] TF-TRT Warning: Engine creation for TRTEngineOp_000_003 failed. The native segment will be used instead. Reason: INTERNAL: tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:217 TRT_ENSURE_OK failure:
  INTERNAL: ./tensorflow/compiler/tf2tensorrt/convert/ops/layer_utils.h:610 TRT_ENSURE failure
2023-02-16 12:16:51.835262: W tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:936] TF-TRT Warning: Engine retrieval for input shapes: [[2,28,28,64]] failed. Running native segment for TRTEngineOp_000_003
2023-02-16 12:16:54.067165: W tensorflow/compiler/tf2tensorrt/utils/trt_logger.cc:83] TF-TRT Warning: DefaultLogger The NetworkDefinitionCreationFlag::kEXPLICIT_PRECISION flag has been deprecated and has no effect. Please do not use this flag when creating the network.
2023-02-16 12:16:54.068481: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_5/custom_conv2d_5_q1_i has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-16 12:16:54.175458: W tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:1103] TF-TRT Warning: Engine creation for TRTEngineOp_000_004 failed. The native segment will be used instead. Reason: INTERNAL: tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:217 TRT_ENSURE_OK failure:
  INTERNAL: ./tensorflow/compiler/tf2tensorrt/convert/ops/layer_utils.h:610 TRT_ENSURE failure
2023-02-16 12:16:54.175760: W tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:936] TF-TRT Warning: Engine retrieval for input shapes: [[2,14,14,64]] failed. Running native segment for TRTEngineOp_000_004
2023-02-16 12:16:54.201923: W tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:936] TF-TRT Warning: Engine retrieval for input shapes: [[2,112,112,64]] failed. Running native segment for TRTEngineOp_000_000
</details>
dauxcl2d

dauxcl2d1#

请使用以下定义重新执行:

export TF_TRT_SHOW_DETAILED_REPORT=1

请将日志输出复制到这里。谢谢

9ceoxa92

9ceoxa922#

你好,

感谢快速的回复。我已经使用了你提到的标志,并收集了以下日志。

在使用 quantize_and_dequantizetrt_func = converter.convert() 步骤时

2023-02-17 04:15:02.325189: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:02.334523: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:02.335893: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:02.337785: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-17 04:15:02.338256: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:02.339644: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:02.341033: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:02.463483: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:02.464255: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:02.464901: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:02.465529: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20554 MB memory:  -> device: 0, name: NVIDIA A10, pci bus id: 0000:04:00.0, compute capability: 8.6
2023-02-17 04:15:02.913897: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:02.914577: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 1
2023-02-17 04:15:02.914799: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session
2023-02-17 04:15:02.915229: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:02.915866: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:02.916488: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:02.917169: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:02.917785: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:02.918368: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20554 MB memory:  -> device: 0, name: NVIDIA A10, pci bus id: 0000:04:00.0, compute capability: 8.6
INFO:tensorflow:Clearing prior device assignments in loaded saved model
INFO:tensorflow:Automatic mixed precision will be used on the whole TensorFlow Graph. This behavior can be deactivated using the environment variable: TF_TRT_EXPERIMENTAL_FEATURES=deactivate_mixed_precision.
More information can be found on: https://www.tensorflow.org/guide/mixed_precision.
2023-02-17 04:15:03.126010: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:03.126716: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 1
2023-02-17 04:15:03.126943: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session
2023-02-17 04:15:03.127281: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:03.127933: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:03.128545: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:03.129232: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:03.129848: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-17 04:15:03.130423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20554 MB memory:  -> device: 0, name: NVIDIA A10, pci bus id: 0000:04:00.0, compute capability: 8.6
2023-02-17 04:15:03.143752: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:2359] Running auto_mixed_precision graph optimizer
2023-02-17 04:15:03.152657: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1195] Automatic Mixed Precision Grappler Pass Summary:

Total processable nodes: 46
Recognized nodes available for conversion: 11
Total nodes converted: 6
Total FP16 Cast ops used (excluding Const and Variable casts): 10
Allowlisted nodes converted: 5
Denylisted nodes blocking conversion: 0
Nodes blocked from conversion by denylisted nodes: 0

For more information regarding mixed precision training, including how to make automatic mixed precision aware of a custom op type, please see the documentation available here:
https://docs.nvidia.com/deeplearning/frameworks/tensorflow-user-guide/index.html#tfamp

2023-02-17 04:15:03.158267: I tensorflow/compiler/tf2tensorrt/convert/trt_optimization_pass.cc:206] [TF-TRT] Using explicit QDQ mode
2023-02-17 04:15:03.159885: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_1/custom_conv2d_1q1_w has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-17 04:15:03.160112: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_2/custom_conv2d_2q1_w has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-17 04:15:03.160292: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_3/custom_conv2d_3q1_w has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-17 04:15:03.160497: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_4/custom_conv2d_4q1_w has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-17 04:15:03.160682: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_5/custom_conv2d_5q1_w has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-17 04:15:03.160711: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_1/custom_conv2d_1_q1_i has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-17 04:15:03.160782: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_2/custom_conv2d_2_q1_i has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-17 04:15:03.160815: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_3/custom_conv2d_3_q1_i has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-17 04:15:03.160846: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_4/custom_conv2d_4_q1_i has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-17 04:15:03.160876: W tensorflow/compiler/tf2tensorrt/convert/ops/quantization_ops.cc:146] QuantizeAndDequantizeV2: StatefulPartitionedCall/sequential/custom_conv2d_5/custom_conv2d_5_q1_i has narrow_range=true, but for TensorRT conversion, narrow_range=false is recommended.
2023-02-17 04:15:03.160929: W tensorflow/compiler/tf2tensorrt/segment/segment.cc:952] 

################################################################################
TensorRT unsupported/non-converted OP Report:
        - Conv2D -> 5x
                - [Count: 5x] Conv2D expects kernel of dimension 4

        - NoOp -> 2x
                - [Count: 2x] Op type NoOp is not supported.

        - Identity -> 1x
                - [Count: 1x] excluded by segmenter option. Most likely an input or output node.

        - Placeholder -> 1x
                - [Count: 1x] excluded by segmenter option. Most likely an input or output node.

--------------------------------------------------------------------------------
        - Total nonconverted OPs: 9
        - Total nonconverted OP Types: 4
For more information see https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#supported-ops.
################################################################################

2023-02-17 04:15:03.161089: W tensorflow/compiler/tf2tensorrt/segment/segment.cc:1280] The environment variable TF_TRT_MAX_ALLOWED_ENGINES=20 has no effect since there are only 5 TRT Engines with  at least minimum_segment_size=3 nodes.
2023-02-17 04:15:03.161110: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:799] Number of TensorRT candidate segments: 5
2023-02-17 04:15:03.162706: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:916] Replaced segment 0 consisting of 3 nodes by TRTEngineOp_000_000.
2023-02-17 04:15:03.162745: W tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:919] TF-TRT Warning: Cannot replace segment 1 consisting of 16 nodes by TRTEngineOp_000_001 reason: Segment has no inputs (possible constfold failure) (keeping original segment).
2023-02-17 04:15:03.162816: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:916] Replaced segment 2 consisting of 3 nodes by TRTEngineOp_000_002.
2023-02-17 04:15:03.162881: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:916] Replaced segment 3 consisting of 3 nodes by TRTEngineOp_000_003.
2023-02-17 04:15:03.162950: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:916] Replaced segment 4 consisting of 3 nodes by TRTEngineOp_000_004.
2023-02-17 04:15:03.165975: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1533] No allowlist ops found, nothing to do
2023-02-17 04:15:03.167998: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1533] No allowlist ops found, nothing to do
2023-02-17 04:15:03.169708: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1533] No allowlist ops found, nothing to do
2023-02-17 04:15:03.171302: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1533] No allowlist ops found, nothing to do

设置了这个标志的 converter.build(input_fn=input_fn) 步骤似乎抑制了我所得到的所有警告,而没有它的话。
我注意到了其他的事情,当我检查转换后的模型摘要时,即转换器显示输入和输出的数据类型为 float16,而在检查 TF-TRT 转换前的原始模型时,我可以看到层 dtypes 被推断为 float32

>>> converter.summary()
TRTEngineOP Name                 Device        # Nodes # Inputs      # Outputs     Input DTypes       Output Dtypes      Input Shapes       Output Shapes     
================================================================================================================================================================

----------------------------------------

TRTEngineOp_000_000              device:GPU:0  5       1             1             ['float16']        ['float16']        [[-1, 112, 112 ... [[-1, 112, 112 ...

        - Cast: 2x
        - Const: 2x
        - QuantizeAndDequantizeV2: 1x

----------------------------------------

TRTEngineOp_000_002              device:GPU:0  5       1             1             ['float16']        ['float16']        [[-1, 56, 56, 64]] [[-1, 56, 56, 64]]

        - Cast: 2x
        - Const: 2x
        - QuantizeAndDequantizeV2: 1x

----------------------------------------

TRTEngineOp_000_003              device:GPU:0  5       1             1             ['float16']        ['float16']        [[-1, 28, 28, 64]] [[-1, 28, 28, 64]]

        - Cast: 2x
        - Const: 2x
        - QuantizeAndDequantizeV2: 1x

----------------------------------------

TRTEngineOp_000_004              device:GPU:0  5       1             1             ['float16']        ['float16']        [[-1, 14, 14, 64]] [[-1, 14, 14, 64]]

        - Cast: 2x
        - Const: 2x
        - QuantizeAndDequantizeV2: 1x

================================================================================================================================================================
[*] Total number of TensorRT engines: 4
[*] % of OPs Converted: 41.67% [20/48]

这是 TF-TRT 的自动行为吗?可以被抑制吗?

6za6bjd0

6za6bjd03#

有关此问题的任何更新?@DEKHTIARJonathan,上面的日志是否有助于解决此问题,还是需要更多信息?

xghobddn

xghobddn4#

请注意,TF-TRT的显式量化和反量化仍然是实验性的,并不真正受支持。我刚刚完成了处理您提到的一些问题,包括不支持Conv2D(对我来说是因为它有一个Tensor输入)。

@DEKHTIARJonathan 是否有人在积极地解决这个问题?我很乐意加入并添加我已经找到的修复。

x33g5p2x

x33g5p2x5#

我已经在这里打开了另一个问题:#60168
我认为我已经实施的修复措施是真正的错误,也可以在这里提供帮助@codejaeger。

相关问题