一旦在hyperopt中达到time_budget_s
,所有的Ray Actors会逐步死亡。
2022-08-04 17:47:10,441 INFO stopper.py:350 -- Reached timeout of 180 seconds. Stopping all trials.
== Status ==
Current time: 2022-08-04 17:47:10 (running for 00:03:01.46)
Memory usage on this node: 20.1/246.4 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/150 CPUs, 0/12 GPUs, 0.0/500.0 GiB heap, 0.0/500.0 GiB objects (0.0/3.0 accelerator_type:T4)
Current best trial: b74f3_00002 with metric_score=0.130369803722715 and parameters={'trainer.learning_rate': 0.005, 'trainer.decay_steps': 10000}
Result logdir: /home/ray/src/hyperopt
Number of trials: 4/4 (4 TERMINATED)
+-------------------+------------+----------------------+-----------------------+-------------------------+--------+------------------+----------------+
| Trial name | status | loc | trainer.decay_steps | trainer.learning_rate | iter | total time (s) | metric_score |
|-------------------+------------+----------------------+-----------------------+-------------------------+--------+------------------+----------------|
| trial_b74f3_00000 | TERMINATED | 192.168.44.226:44064 | 10000 | 0.001 | 10 | 165.849 | 0.132162 |
| trial_b74f3_00001 | TERMINATED | 192.168.74.69:5093 | 2000 | 0.005 | 11 | 172.588 | 0.131108 |
| trial_b74f3_00002 | TERMINATED | 192.168.72.27:6452 | 10000 | 0.005 | 10 | 166.155 | 0.13037 |
| trial_b74f3_00003 | TERMINATED | 192.168.45.45:55382 | 8000 | 0.001 | 10 | 162.189 | 0.132678 |
+-------------------+------------+----------------------+-----------------------+-------------------------+--------+------------------+----------------+
Training: 11%|█ | 120/1100 [02:29<10:47, 1.51it/s]
2022-08-04 17:47:10,603 WARNING worker.py:1382 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 77003339209b3e674c1826ef52407c82b1d419681c000000 Worker ID: fd87e59263039f1b712913a4de1750c0f527bae210284a5c54307c2b Node ID: 0b05b201854acb7ec8473e64f8b224140dc47236ddc8ecfb9903c3fe Worker IP address: 192.168.45.45 Worker port: 10201 Worker PID: 56401
(BaseWorkerMixin pid=5516, ip=192.168.35.224) The actor is dead because its owner has died. Owner Id: 7304cad6ec56a8c825c4de04dcb3f0106c885bc42b6ab195439eefe6 Owner Ip address: 192.168.45.45 Owner worker exit type: INTENDED_EXIT
2022-08-04 17:47:13,554 WARNING worker.py:1382 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 2fad8fa976dfdce26f68298242b39ff6194200d51c000000 Worker ID: a0a4f773d2064aa4385a9d7bf97b055a254d7e86b10b8ef35ffb6e91 Node ID: 0017c30b633d4339dba7461ec73a43b1f6e65ae839edf4bae757dcc9 Worker IP address: 192.168.86.8 Worker port: 10193 Worker PID: 957
然后事情似乎会暂时挂起(比如30秒到1分钟),然后Ray Tune返回试验结果。
理想情况下,ray workers/actors不应该死亡。
2条答案
按热度按时间r9f1avp51#
@jeffkinnison @arnavgarg1 这会导致模型训练失败,还是只是暂时的延迟?如果是前者,建议使用P0;如果是后者,建议使用P1/P2 cc @tgaddair
8yparm6h2#
@drishi 我还没有看到这导致任何训练失败,但肯定会有延迟。我同意这可能是一个P1问题。在Predibase方面,这可能只是一段令人困惑的经历,因为在转移到评估阶段之前,
trials
中的任何一个都可能在几分钟内不会更新其指标。