🐛 Bug
我们在Kubernetes上运行一个Aim服务器,并从多个虚拟机跟踪实验,有时候我们会看到以下错误,之后即使实验还在继续,也无法向Aim提交指标。
E1109 19:56:44.231976010 299181 ssl_transport_security.cc:552] Corruption detected.
E1109 19:56:44.232044682 299181 ssl_transport_security.cc:528] error:100003fc:SSL routines:OPENSSL_i
nternal:SSLV3_ALERT_BAD_RECORD_MAC
E1109 19:56:44.232062463 299181 secure_endpoint.cc:205] Decryption error: TSI_DATA_CORRUPTED
Exception in thread Thread-13 (worker):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/aim/ext/transport/rpc
_queue.py", line 55, in worker
if self._try_exec_task(task_f, *args):
File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/aim/ext/transport/rpc
_queue.py", line 85, in _try_exec_task
raise e
File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/aim/ext/transport/rpc
_queue.py", line 81, in _try_exec_task
task_f(*args)
File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/aim/ext/transport/cli
ent.py", line 299, in _run_write_instructions
response = self.remote.run_write_instructions(message_stream_generator(), metadata=self._request_
metadata)
File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/grpc/_channel.py", li
ne 1131, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/grpc/_channel.py", li
ne 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Stream removed"
debug_error_string = "{"created":"@1699559804.232117923","description":"Error received from p
eer ipv4:10.91.128.8:443","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Stre
am removed","grpc_status":2}"
重现方法
我没有一个MWE(最小可复现示例),似乎具有随机性,有时会发生。
预期行为
跟踪是可靠的,没有丢失指标/实验。
环境
- Aim版本:3.17.5
- Python版本:3.10
- pip版本:
- 操作系统:Linux
- 其他相关信息
2条答案
按热度按时间pgx2nnw81#
感谢您的快速回复,我会尝试使用那个环境变量。
chy5wohz2#
嘿,@hstojic!感谢你的报告。从上面的日志来看,在使用SSL和grpc(我们使用grpc进行服务器跟踪)时似乎出现了问题。我进行了一些调查,并在grpc的仓库中找到了一些类似的问题。
GRPC_POLL_STRATEGY
环境变量如果这些建议有助于解决问题,请告诉我。如果没有,我会请你提供一些关于你正在运行的设置的详细信息。