使用docker容器在自定义集群上部署ray

2izufjch  于 2021-06-09  发布在  Redis
关注(0)|答案(0)|浏览(848)

这个问题是直接从一个问题,我已经打开了对射线存储库,我希望得到更多的曝光也张贴在这里。
我在过去的问题上看到了类似的问题,这些问题与老版本的ray和类似的问题有关,但由于它们没有提供一个清晰的设置或清晰的解决方案,但通常是一个黑客“通过添加这个标志它工作”,我决定发布这个问题,试图清楚地解释我如何设置ray的每一个小步骤,使docker文件可用,使我运行的特定命令和收到的输出可用,而不是我设法收集的提示。
希望这是一个值得问的问题,就这样。

有什么问题?

即使所有集群节点在 Jmeter 板中都可用并且不显示任何错误,在head节点上执行与光线相关的python代码也只会使head节点可用,而在节点上它会开始输出:

WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?

ray版本和其他系统信息(python版本,tensorflow版本,os):3.6.5,无(目前未安装),ubuntu18.04

再生产

在标题中,我正在尝试使用docker容器在自定义集群上设置ray。我的想法是开始在一个小集群上湿脚,然后当我学习如何使用这个库时,在一个slurm集群上部署它(我已经看到了一个关于这个的小教程)。
我的小设置在我为此创建的存储库中进行了详细说明:基本上,它使用本教程提供的docker映像作为文档,然后安装其他工具,如byobu,主要用于调试目的。
构建serverdockerfile后,我按如下方式启动容器:

docker run --shm-size=16GB -t --tty --interactive --network host experimenting_on_ray_server

从容器中,然后我用以下命令启动ray:

ray start --head

这将输出:

2020-04-15 20:08:05,148 INFO scripts.py:357 -- Using IP address xxx.xxx.xxx.xxx for this node.
2020-04-15 20:08:05,151 INFO resource_spec.py:212 -- Starting Ray with 122.61 GiB memory available for workers and up to 56.56 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-04-15 20:08:05,629 INFO services.py:1148 -- View the Ray dashboard at localhost:8265
2020-04-15 20:08:05,633 WARNING services.py:1470 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 17179869184 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
2020-04-15 20:08:05,669 INFO scripts.py:387 -- 
Started Ray on this node. You can add additional nodes to the cluster by calling

    ray start --address='xxx.xxx.xxx.xxx:53158' --redis-password='5241590000000000'

from the node you wish to add. You can connect a driver to the cluster from Python by running

    import ray
    ray.init(address='auto', redis_password='5241590000000000')

If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run

    ray stop

其中...是这台机器的公共ip,因为docker容器已经用 --network 选项。我不明白为什么会出现这个警告,就像RayDocker教程中的文档所说的那样 Replace <shm-size> with a limit appropriate for your system, for example 512M or 2G ,这里我使用16gb。多少钱就够了?
此时,通过ssh端口转发,我可以看到 Jmeter 板处于联机状态,并显示以下内容:

因为这一切似乎都是名义上的,所以我继续构建clientdockerfile,在这一点上,它的所有意图和目的都与服务器相同。然后我开始运行:

docker run --shm-size=16GB -t --tty --interactive --network host experimenting_on_ray_client

现在我可以运行head节点中提供的命令将另一个节点连接到集群。因此我执行:

ray start --address='xxx.xxx.xxx.xxx:53158' --redis-password='5241590000000000'

再说一次, ... 是我运行带有 --network 旗帜。
这个命令似乎运行成功:如果我现在转到 Jmeter 板,我可以看到第二个可用的节点。在这里 ... 是头节点的ip yyy.yyy.yyy.yyy 是工作节点的ip。

最后,我可以尝试执行一些射线代码!因此,当在python Jmeter 板中执行时,我尝试执行文档中提供的代码和head节点中的以下代码:

import ray
ray.init(address='auto', redis_password='5241590000000000')

import time

@ray.remote
def f():
    time.sleep(0.01)
    return ray.services.get_node_ip_address()

# Get a list of the IP addresses of the nodes that have joined the cluster.

set(ray.get([f.remote() for _ in range(1000)]))

输出:

{'xxx.xxx.xxx.xxx'}

但据我所知,我们期望:

{'xxx.xxx.xxx.xxx', 'yyy.yyy.yyy.yyy'}

如果在worker节点上运行完全相同的代码,则会得到完全不同的输出(或者更像是缺少任何输出)。执行前两行后:

import ray
ray.init(address='auto', redis_password='5241590000000000')

我得到:

2020-04-15 20:29:53,481 WARNING worker.py:785 -- When connecting to an existing cluster, _internal_config must match the cluster's _internal_config.
2020-04-15 20:29:53,486 WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2020-04-15 20:29:54,491 WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2020-04-15 20:29:55,496 WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2020-04-15 20:29:56,500 WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2020-04-15 20:29:57,505 WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/ray/python/ray/worker.py", line 802, in init
    connect_only=True)
  File "/ray/python/ray/node.py", line 126, in __init__
    redis_password=self.redis_password)
  File "/ray/python/ray/services.py", line 204, in get_address_info_from_redis
    redis_address, node_ip_address, redis_password=redis_password)
  File "/ray/python/ray/services.py", line 187, in get_address_info_from_redis_helper
    "Redis has started but no raylets have registered yet.")
RuntimeError: Redis has started but no raylets have registered yet.

Jmeter 板中没有提供任何附加信息,一切看起来都很正常。我已经多次测试了这个问题的可复制性,而我希望在本地网络或两个docker映像中配置错误。这两个docker容器运行在同一个本地网络中的两台不同的机器上,这两台机器的ip看起来像 same.same.same.different .
我还试图通过在同一台机器上运行两个docker来重现错误。该问题也出现在此设置中。
我可以提供哪些其他信息可以帮助您?

更新1:找到新的相关文件。

搜索路径中存在的raylet错误日志文件时 /tmp/ray/session_latest/logs/raylet.err ,在服务器和客户端以及在执行python代码之前和之后都是空的,我注意到另一个错误日志可能对当前问题感兴趣。
文件位于以下位置: /tmp/raylet.595a989643d2.invalid-user.log.WARNING.20200416-181435.22 ,并包含以下内容:

Log file created at: 2020/04/16 18:14:35
Running on machine: 595a989643d2
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
W0416 18:14:35.525002    22 node_manager.cc:574] Received NodeRemoved callback for an unknown client a4e873ae0f72e58105e16c664b3acdda83f80553.

更新2:raylet.out文件不是空的

即使 raylet.err 客户端和服务器上的文件都是空的 raylet.out 文件不是。这是他们的内容。
服务器 raylet.out 文件

I0417 05:50:03.973958    38 stats.h:62] Succeeded to initialize stats: exporter address is 127.0.0.1:8888
I0417 05:50:03.975106    38 redis_client.cc:141] RedisClient connected.
I0417 05:50:03.983482    38 redis_gcs_client.cc:84] RedisGcsClient Connected.
I0417 05:50:03.984493    38 service_based_gcs_client.cc:63] ServiceBasedGcsClient Connected.
I0417 05:50:03.985126    38 grpc_server.cc:64] ObjectManager server started, listening on port 42295.
I0417 05:50:03.989686    38 grpc_server.cc:64] NodeManager server started, listening on port 44049.

客户 raylet.out 文件
这是文件的一个子集。它显示了数百行,例如:

I0417 05:50:32.865006    23 node_manager.cc:734] [HeartbeatAdded]: received heartbeat from unknown client id 93a2294c6c338410485494864268d8eeeaf2ecc5
I0417 05:50:32.965395    23 node_manager.cc:734] [HeartbeatAdded]: received heartbeat from unknown client id 93a2294c6c338410485494864268d8eeeaf2ecc5

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题