aim 服务器-客户端连接错误

to94eoyn  于 5个月前  发布在  其他
关注(0)|答案(2)|浏览(128)

🐛 Bug

我们在Kubernetes上运行一个Aim服务器,并从多个虚拟机跟踪实验,有时候我们会看到以下错误,之后即使实验还在继续,也无法向Aim提交指标。

E1109 19:56:44.231976010  299181 ssl_transport_security.cc:552] Corruption detected.
E1109 19:56:44.232044682  299181 ssl_transport_security.cc:528] error:100003fc:SSL routines:OPENSSL_i
nternal:SSLV3_ALERT_BAD_RECORD_MAC
E1109 19:56:44.232062463  299181 secure_endpoint.cc:205]     Decryption error: TSI_DATA_CORRUPTED
Exception in thread Thread-13 (worker):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/aim/ext/transport/rpc
_queue.py", line 55, in worker
    if self._try_exec_task(task_f, *args):
  File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/aim/ext/transport/rpc
_queue.py", line 85, in _try_exec_task
    raise e
  File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/aim/ext/transport/rpc
_queue.py", line 81, in _try_exec_task
    task_f(*args)
  File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/aim/ext/transport/cli
ent.py", line 299, in _run_write_instructions
    response = self.remote.run_write_instructions(message_stream_generator(), metadata=self._request_
metadata)
  File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/grpc/_channel.py", li
ne 1131, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/grpc/_channel.py", li
ne 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Stream removed"
        debug_error_string = "{"created":"@1699559804.232117923","description":"Error received from p
eer ipv4:10.91.128.8:443","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Stre
am removed","grpc_status":2}"

重现方法

我没有一个MWE(最小可复现示例),似乎具有随机性,有时会发生。

预期行为

跟踪是可靠的,没有丢失指标/实验。

环境

  • Aim版本:3.17.5
  • Python版本:3.10
  • pip版本:
  • 操作系统:Linux
  • 其他相关信息
pgx2nnw8

pgx2nnw81#

感谢您的快速回复,我会尝试使用那个环境变量。

chy5wohz

chy5wohz2#

嘿,@hstojic!感谢你的报告。从上面的日志来看,在使用SSL和grpc(我们使用grpc进行服务器跟踪)时似乎出现了问题。我进行了一些调查,并在grpc的仓库中找到了一些类似的问题。

  1. 为客户端设置GRPC_POLL_STRATEGY环境变量
  2. 解密错误:TSI_DATA_CORRUPTED grpc/grpc#23144 (评论)

如果这些建议有助于解决问题,请告诉我。如果没有,我会请你提供一些关于你正在运行的设置的详细信息。

相关问题