kubernetes Ray -无法通过URL连接到头节点GCS

e5njpo68  于 2023-10-17  发布在  Kubernetes
关注(0)|答案(1)|浏览(205)

我在一个托管的Kubernetes环境中工作。我们有三个节点(托管K8S部署+服务+入口)设置-一个头节点和两个工作节点。使用Service和Ingress配置,我通过(内部)URL http://head-node-dashboard.company.internal.domain.com暴露容器的端口8265,通过http://head-node-gcs.company.internal.domain.com暴露6379。
当我尝试将作业提交到 Jmeter 板URL时,一切正常:

ray job submit --working-dir ./ --address='http://head-node-dashboard.company.internal.domain.com' -- python ./script.py

但当我试图连接到GCS时,它失败了。这种情况有两种方式:

  • 使用ray start将工作节点连接到头节点:
$ > ray start --address='head-node-gcs.company.internal.domain.com:80'
Local node IP: 10.251.222.101
2023-03-18 06:51:17,521 WARNING utils.py:1446 -- Unable to connect to GCS at head-node-gcs.company.internal.domain.com:80. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
  • 通过ray.init()连接到头节点。
$ > python
>>> import ray

a)如果我连接时没有定义任何协议:

>>> ray.init(address='head-node-gcs.company.internal.domain.com:80')
2023-03-18 06:58:11,670 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: head-node-gcs.company.internal.domain.com:80...
2023-03-18 06:58:16,743 WARNING utils.py:1333 -- Unable to connect to GCS at head-node-gcs.company.internal.domain.com:80. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

B)如果我使用指定的http://协议连接:

>>> ray.init(address='http://head-node-gcs.company.internal.domain.com')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/worker.py", line 1230, in init
    builder = ray.client(address, _deprecation_warn_enabled=False)
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/client_builder.py", line 382, in client
    builder = _get_builder_from_address(address)
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/client_builder.py", line 350, in _get_builder_from_address
    assert "ClientBuilder" in dir(
AssertionError: Module: http does not have ClientBuilder.

c)如果我使用指定的ray://协议连接:

>>> ray.init(address='ray://head-node-gcs.company.internal.domain.com')
/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/worker.py:253: UserWarning: Ray Client connection timed out. Ensure that the Ray Client port on the head node is reachable from your local machine. See https://docs.ray.io/en/latest/cluster/ray-client.html#step-2-check-ports for more information.
  warnings.warn(
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/_private/worker.py", line 1248, in init
    ctx = builder.connect()
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/client_builder.py", line 178, in connect
    client_info_dict = ray.util.client_connect.connect(
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client_connect.py", line 47, in connect
    conn = ray.connect(
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/__init__.py", line 252, in connect
    conn = self.get_context().connect(*args, **kw_args)
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/__init__.py", line 94, in connect
    self.client_worker = Worker(
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/worker.py", line 139, in __init__
    self._connect_channel()
  File "/project/workSpace/venv/lib/python3.8/site-packages/ray/util/client/worker.py", line 260, in _connect_channel
    raise ConnectionError("ray client connection timeout")
ConnectionError: ray client connection timeout

1.工作线程到头节点的连接应使用指定的URL。如果我给予头节点的本地IP,它就可以工作:

$ > ray init --address='10.251.222.100:6379'
Local node IP: 10.251.222.101
2023-03-18 07:20:41,943 WARNING services.py:1791 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=2.47gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
[2023-03-18 07:20:41,964 I 115596 115596] global_state_accessor.cc:356: This node has an IP address of 10.251.222.101, while we can not find the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.

--------------------
Ray runtime started.
--------------------

To terminate the Ray runtime, run
  ray stop

这是我希望从ray init --address='head-node-gcs.company.internal.domain.com:80'命令中获得的行为。
1.这是头节点的服务配置的相关部分:

ports:
  - name: ray-dashboard
    port: 8265
    targetPort: 8265
    protocol: TCP
  - name: ray-gcs
    port: 6379
    targetPort: 6379
    protocol: TCP
  - name: ray-client
    port: 10001
    targetPort: 10001
    protocol: TCP
  - name: ray-serve
    port: 8000
    targetPort: 8000
    protocol: TCP
  type: ClusterIP

这是头节点Ingress配置的相关部分:

spec:
  rules:
  - host: head-node-dashboard.company.internal.domain.com
    http:
      paths:
      - path: /
        backend:
          serviceName: head-node-svc
          servicePort: 8265
  - host: head-node-gcs.company.internal.domain.com
    http:
      paths:
      - path: /
        backend:
          serviceName: head-node-svc
          servicePort: 6379
  - host: head-node-client.company.internal.domain.com
    http:
      paths:
      - path: /
        backend:
          serviceName: head-node-svc
          servicePort: 10001
  - host: head-node-serve.company.internal.domain.com
    http:
      paths:
      - path: /
        backend:
          serviceName: head-node-svc
          servicePort: 8000
$ > ray --version
ray, version 2.3.0

$ > python --version
Python 3.7.4

$ > uname -a
Linux head-node-659568794c-rwmpk 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 GNU/Linux
swvgeqrz

swvgeqrz1#

我无法通过入口使GCS工作。(我得到了关于缺少session_name的错误)。
最后,我设法使它工作,感谢External DNS
我用metadata.annotations.external-dns.alpha.kubernetes.io/hostname: [my-host]配置了kind: Service,然后我就可以用ray debug

(local machine) ray start --address=[my-host (on k8s)]:6379
(local machine) ray debug --address=[my-host (on k8s)]:6379

请注意,我也收到了这个:
无法连接到[我的主机]上的GCS:80。检查(1)具有匹配版本的Ray GCS在指定地址成功启动,以及(2)没有防火墙设置阻止访问。
运行ray start ...时。然而,在对同一条消息进行多次迭代后,它正确连接,并且所有后续连接(再次是ray stopray start ...)都是即时的。

相关问题