描述bug
当我尝试使用AsyncHyperband运行Hyperopt时,在我的控制台上看到以下内容:
Stage 0: : 4it [00:10, 2.27s/it] pid=5557, ip=192.168.44.226)
Stage 0: : 4it [00:10, 2.39s/it] pid=5556, ip=192.168.44.226)
Training: 0%| | 1/101457092405402533877 [00:00<22007761214225922:16:32, 1.28it/s]
Training: 0%| | 0/101457092405402533877 [00:00<?, ?it/s]
(BaseWorkerMixin pid=581, ip=192.168.35.224) Note: steps_per_checkpoint (was 2000) is now set to the number of steps per epoch: 11.
(BaseWorkerMixin pid=581, ip=192.168.35.224)
(BaseWorkerMixin pid=581, ip=192.168.35.224) Training for 101457092405402533877 step(s), approximately 9223372036854775808 epoch(s).
(BaseWorkerMixin pid=581, ip=192.168.35.224) Early stopping policy: -1 round(s) of evaluation, or -11 step(s), approximately -1 epoch(s).
(BaseWorkerMixin pid=581, ip=192.168.35.224)
(BaseWorkerMixin pid=581, ip=192.168.35.224) Starting with step 0, epoch: 0
Stage 0: : 4it [00:10, 3.15s/it] pid=5555, ip=192.168.44.226)
Stage 0: : 5it [00:11, 2.21s/it] pid=5555, ip=192.168.44.226)
Training: 0%| | 2/101457092405402533877 [00:01<18155989808141202:46:24, 1.55it/s]
Training: 0%| | 1/101457092405402533877 [00:00<21823230788576069:24:16, 1.29it/s]
Training: 0%| | 3/101457092405402533877 [00:01<16904276494085920:59:44, 1.67it/s]
Training: 0%| | 2/101457092405402533877 [00:01<18189075729949848:27:44, 1.55it/s]
Training: 0%| | 0/101457092405402533877 [00:00<?, ?it/s]
Training: 0%| | 4/101457092405402533877 [00:02<16324377596747498:22:56, 1.73it/s]
Ludwig输出了大量的步骤和周期数,以及早停策略的负值。这感觉是一个很大的bug,其中一些内部机制出现了严重的问题。特别是,当 time_budget_s
达到给定的hyperopt实验之前就已经开始训练时,这种情况似乎就会发生。
Stage 0: : 3it [00:03, 1.28it/s] pid=12614)
Stage 0: : 2it [00:03, 1.48s/it]
Stage 0: : 3it [00:03, 1.15it/s] pid=12615) 3.51s/it]
2022-08-04 16:26:11,494 INFO stopper.py:350 -- Reached timeout of 20 seconds. Stopping all trials.
== Status ==
Current time: 2022-08-04 16:26:11 (running for 00:00:24.99)
Memory usage on this node: 9.1/246.4 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 50.000: None
Resources requested: 0/150 CPUs, 0/12 GPUs, 0.0/500.0 GiB heap, 0.0/500.0 GiB objects (0.0/3.0 accelerator_type:T4)
Result logdir: /home/ray/src/results/criteo_hyperopt
Number of trials: 4/4 (4 TERMINATED)
+-------------------+------------+---------------------+-----------------------+-------------------------+
| Trial name | status | loc | trainer.decay_steps | trainer.learning_rate |
|-------------------+------------+---------------------+-----------------------+-------------------------|
| trial_c467a_00000 | TERMINATED | 192.168.44.226:5383 | 10000 | 0.001 |
| trial_c467a_00001 | TERMINATED | 192.168.86.8:12066 | 2000 | 0.005 |
| trial_c467a_00002 | TERMINATED | 192.168.72.27:697 | 10000 | 0.005 |
| trial_c467a_00003 | TERMINATED | 192.168.44.226:5414 | 8000 | 0.001 |
+-------------------+------------+---------------------+-----------------------+-------------------------+
紧接着,模型似乎开始使用上面显示的步数和周期数进行训练,但实际上,试验已经被取消,实际上不应该进行训练。似乎存在某种协调问题,Tune失败/停止与试验/模型初始化之间存在某种问题。
重现
重现行为的方法:
- 使用任何数据集(稍微大一点,以便需要一段时间才能开始第一个训练周期)
- 将hyperopt执行器设置为以下内容:
executor:
type: ray
scheduler:
type: async_hyperband
max_t: 50
time_attr: time_total_s
grace_period: 50
reduction_factor: 5
num_samples: 4
time_budget_s: 20
cpu_resources_per_trial: 1
gpu_resources_per_trial: 1
这里要确保您的 time_budget_s
小于实际开始训练之前的那段时间。主要目标是在试验甚至开始训练之前终止它们。
请提供代码、yaml配置文件和示例数据以完全重现该问题。无法重现的问题将被忽略。
预期行为
训练永远不应该开始,应该返回一个空的hyperopt_results对象。理想情况下,有一个非常明确的警告,表明事情已经失败了。
环境(请完成以下信息):
- OS:例如iOS
- Python版本:3.8.13
- Ludwig版本:0.6.dev
2条答案
按热度按时间bmvo0sr51#
几个问题:
vqlkdk9b2#
这似乎发生在
time_budget_s
被击中时,但试验从未开始训练,我们看到以下警告信息:此时试验似乎已经终止,但训练似乎在试验终止后开始,然后我们看到类似以下内容: