问题类型
Bug
来源
二进制文件
Tensorflow版本
v2.11.0-0-gd5b57ca9 2.11.0
自定义代码
无
OS平台和发行版
Linux t1v-n-92ea8b2a-w-0 5.15.0-1022-gcp #29 ~20.04.1-Ubuntu SMP Sat Oct 29 18:17:56 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
移动设备
- 无响应*
Python版本
3.8
Bazel版本
- 无响应*
GCC/编译器版本
- 无响应*
CUDA/cuDNN版本
- 无响应*
GPU型号和内存
- 无响应*
当前行为?
在v3-8
TPU-VM上运行tensorflow版本:tpu-vm-tf-2.11.0
,我无法在CPU上运行一个基本函数。请指导我如何在TPU VM的CPU上运行jit_compiled函数。
重现问题的独立代码
import os
os.environ["TPU_NAME"] = "local"
os.environ["TPU_LOAD_LIBRARY"] = "1"
import tensorflow as tf
tf.debugging.set_log_device_placement(True)
print("All devices: ", tf.config.list_logical_devices())
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]])
b = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]])
@tf.function(jit_compile=True)
def jit_test(a, b):
c = tf.matmul(a, b)
return a + b + c
with tf.device(":/TPU:0"):
print(jit_test(a, b))
print("Success!")
with tf.device(":/CPU:0"):
print(jit_test(a, b)) # This will fail
print("Will crash prior to getting here")
相关日志输出
2022-12-19 10:13:18.533185: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-19 10:13:18.717772: I tensorflow/core/tpu/tpu_initializer_helper.cc:275] Libtpu path is: libtpu.so
D1219 10:13:18.874059745 34863 config.cc:113] gRPC EXPERIMENT tcp_frame_size_tuning OFF (default:OFF)
D1219 10:13:18.874080489 34863 config.cc:113] gRPC EXPERIMENT tcp_read_chunks OFF (default:OFF)
D1219 10:13:18.874093652 34863 config.cc:113] gRPC EXPERIMENT tcp_rcv_lowat OFF (default:OFF)
D1219 10:13:18.874100741 34863 config.cc:113] gRPC EXPERIMENT peer_state_based_framing OFF (default:OFF)
D1219 10:13:18.874107419 34863 config.cc:113] gRPC EXPERIMENT flow_control_fixes OFF (default:OFF)
D1219 10:13:18.874114099 34863 config.cc:113] gRPC EXPERIMENT memory_pressure_controller OFF (default:OFF)
D1219 10:13:18.874121059 34863 config.cc:113] gRPC EXPERIMENT periodic_resource_quota_reclamation ON (default:ON)
D1219 10:13:18.874127645 34863 config.cc:113] gRPC EXPERIMENT unconstrained_max_quota_buffer_size OFF (default:OFF)
D1219 10:13:18.874134219 34863 config.cc:113] gRPC EXPERIMENT new_hpack_huffman_decoder OFF (default:OFF)
D1219 10:13:18.874140862 34863 config.cc:113] gRPC EXPERIMENT event_engine_client OFF (default:OFF)
D1219 10:13:18.874147728 34863 config.cc:113] gRPC EXPERIMENT monitoring_experiment ON (default:ON)
D1219 10:13:18.874154168 34863 config.cc:113] gRPC EXPERIMENT promise_based_client_call OFF (default:OFF)
I1219 10:13:18.874398506 34863 ev_epoll1_linux.cc:121] grpc epoll fd: 6
D1219 10:13:18.874414089 34863 ev_posix.cc:141] Using polling engine: epoll1
D1219 10:13:18.874434253 34863 dns_resolver_ares.cc:824] Using ares dns resolver
D1219 10:13:18.874733217 34863 lb_policy_registry.cc:45] registering LB policy factory for "priority_experimental"
D1219 10:13:18.874748219 34863 lb_policy_registry.cc:45] registering LB policy factory for "outlier_detection_experimental"
D1219 10:13:18.874756312 34863 lb_policy_registry.cc:45] registering LB policy factory for "weighted_target_experimental"
D1219 10:13:18.874763493 34863 lb_policy_registry.cc:45] registering LB policy factory for "pick_first"
D1219 10:13:18.874770671 34863 lb_policy_registry.cc:45] registering LB policy factory for "round_robin"
D1219 10:13:18.874783165 34863 lb_policy_registry.cc:45] registering LB policy factory for "ring_hash_experimental"
D1219 10:13:18.874810477 34863 lb_policy_registry.cc:45] registering LB policy factory for "grpclb"
D1219 10:13:18.874843143 34863 lb_policy_registry.cc:45] registering LB policy factory for "rls_experimental"
D1219 10:13:18.874864810 34863 lb_policy_registry.cc:45] registering LB policy factory for "xds_cluster_manager_experimental"
D1219 10:13:18.874872835 34863 lb_policy_registry.cc:45] registering LB policy factory for "xds_cluster_impl_experimental"
D1219 10:13:18.874880753 34863 lb_policy_registry.cc:45] registering LB policy factory for "cds_experimental"
D1219 10:13:18.874888414 34863 lb_policy_registry.cc:45] registering LB policy factory for "xds_cluster_resolver_experimental"
D1219 10:13:18.874895665 34863 certificate_provider_registry.cc:35] registering certificate provider factory for "file_watcher"
I1219 10:13:18.895383666 34863 socket_utils_common_posix.cc:336] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
2022-12-19 10:13:18.913051: I tensorflow/core/tpu/tpu_initializer_helper.cc:225] GetTpuPjrtApi not found
2022-12-19 10:13:21.766915: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-19 10:13:26.260445: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x63c6900 initialized for platform TPU (this does not guarantee that XLA will be used). Devices:
2022-12-19 10:13:26.260485: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (0): TPU, 2a886c8
2022-12-19 10:13:26.260499: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (1): TPU, 2a886c8
2022-12-19 10:13:26.260511: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (2): TPU, 2a886c8
2022-12-19 10:13:26.260524: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (3): TPU, 2a886c8
2022-12-19 10:13:26.260536: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (4): TPU, 2a886c8
2022-12-19 10:13:26.260549: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (5): TPU, 2a886c8
2022-12-19 10:13:26.260561: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (6): TPU, 2a886c8
2022-12-19 10:13:26.260573: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (7): TPU, 2a886c8
All devices: [LogicalDevice(name='/device:CPU:0', device_type='CPU'), LogicalDevice(name='/device:TPU_SYSTEM:0', device_type='TPU_SYSTEM'), LogicalDevice(name='/device:TPU:0', device_type='TPU'), LogicalDevice(name='/device:TPU:1', device_type='TPU'), LogicalDevice(name='/device:TPU:2', device_type='TPU'), LogicalDevice(name='/device:TPU:3', device_type='TPU'), LogicalDevice(name='/device:TPU:4', device_type='TPU'), LogicalDevice(name='/device:TPU:5', device_type='TPU'), LogicalDevice(name='/device:TPU:6', device_type='TPU'), LogicalDevice(name='/device:TPU:7', device_type='TPU')]
input: (_Arg): /job:localhost/replica:0/task:0/device:CPU:0
2022-12-19 10:13:26.286675: I tensorflow/core/common_runtime/placer.cc:114] input: (_Arg): /job:localhost/replica:0/task:0/device:CPU:0
_EagerConst: (_EagerConst): /job:localhost/replica:0/task:0/device:CPU:0
2022-12-19 10:13:26.286733: I tensorflow/core/common_runtime/placer.cc:114] _EagerConst: (_EagerConst): /job:localhost/replica:0/task:0/device:CPU:0
output_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:CPU:0
2022-12-19 10:13:26.286753: I tensorflow/core/common_runtime/placer.cc:114] output_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:CPU:0
2022-12-19 10:13:26.287893: I tensorflow/core/common_runtime/eager/execute.cc:1445] Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
2022-12-19 10:13:26.288240: I tensorflow/core/common_runtime/eager/execute.cc:1445] Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
2022-12-19 10:13:26.361102: I tensorflow/core/common_runtime/eager/execute.cc:1445] Executing op __inference_jit_test_11 in device /job:localhost/replica:0/task:0/device:TPU:0
2022-12-19 10:13:26.473898: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
tf.Tensor(
[[ 32. 40. 48.]
[ 74. 91. 108.]
[116. 142. 168.]], shape=(3, 3), dtype=float32)
Success!
2022-12-19 10:13:26.477187: I tensorflow/core/common_runtime/eager/execute.cc:1445] Executing op __inference_jit_test_11 in device /job:localhost/replica:0/task:0/device:CPU:0
2022-12-19 10:13:26.478142: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:417 : NOT_FOUND: could not find registered transfer manager for platform Host -- check target linkage
Traceback (most recent call last):
File "notebooks/tpu_vm_test.py", line 28, in <module>
print(jit_test(a, b)) # This will fail
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.NotFoundError: could not find registered transfer manager for platform Host -- check target linkage [Op:__inference_jit_test_11]
D1219 10:13:26.848936447 34863 init.cc:190] grpc_shutdown starts clean-up now
9条答案
按热度按时间zvms9eto1#
@sushreebarsa,
我能够在tensorflow v2.9、v2.11和nightly版本上复现这个问题。请查看gist here 的概述。
guz6ccqo2#
请提供关于这个的最新更新,谢谢!
23c0lvtd3#
@tilakrayal@sushreebarsa 请问这个问题可以看一下吗?如果您不是合适的受让人,我们是否可以请另一个贡献者来帮助解决这个问题。谢谢。
6tr1vspr4#
代码在GPU环境下运行正常,如附件gist所示,在Colab和带有GPU的虚拟机上也运行正常,如下截图所示。
@sachinprasadhs 请问您能否检查一下这个问题,因为我没有TPU环境来复现。
beq87vna5#
请尝试使用CentralStorageStrategy,这将使在使用
tpustrategy
时将变量放置在CPU上。这将创建一个CentralStorageStrategy示例,该示例将使用所有可见的GPU和CPU。更新副本上的变量将在应用于变量之前进行聚合。ctzwtxfj6#
这个问题已经被自动标记为过时,因为它没有最近的活动。如果没有进一步的活动发生,它将被关闭。谢谢。
cyvaqqii7#
关闭为陈旧状态。如果您想进一步处理此问题,请重新打开。
ycggw6v28#
你对你的问题的解决是否满意?
是
否
eufgjt7s9#
请重新打开。建议的解决方案对TPU pods不起作用。请尝试在pod上解决此问题(我无法为pod提供colab示例)。此问题应该与
TPUStrategy
一起工作。