ludwig Hyperopt Bug:使用Async Hyperband时,评估轮次、步数和周期的负值数量很大

gg58donl  于 5个月前  发布在  其他
关注(0)|答案(2)|浏览(119)

描述bug

当我尝试使用AsyncHyperband运行Hyperopt时,在我的控制台上看到以下内容:

Stage 0: : 4it [00:10,  2.27s/it] pid=5557, ip=192.168.44.226) 
Stage 0: : 4it [00:10,  2.39s/it] pid=5556, ip=192.168.44.226) 
Training:   0%|          | 1/101457092405402533877 [00:00<22007761214225922:16:32,  1.28it/s]
Training:   0%|          | 0/101457092405402533877 [00:00<?, ?it/s]
(BaseWorkerMixin pid=581, ip=192.168.35.224) Note: steps_per_checkpoint (was 2000) is now set to the number of steps per epoch: 11.
(BaseWorkerMixin pid=581, ip=192.168.35.224) 
(BaseWorkerMixin pid=581, ip=192.168.35.224) Training for 101457092405402533877 step(s), approximately 9223372036854775808 epoch(s).
(BaseWorkerMixin pid=581, ip=192.168.35.224) Early stopping policy: -1 round(s) of evaluation, or -11 step(s), approximately -1 epoch(s).
(BaseWorkerMixin pid=581, ip=192.168.35.224) 
(BaseWorkerMixin pid=581, ip=192.168.35.224) Starting with step 0, epoch: 0
Stage 0: : 4it [00:10,  3.15s/it] pid=5555, ip=192.168.44.226) 
Stage 0: : 5it [00:11,  2.21s/it] pid=5555, ip=192.168.44.226) 
Training:   0%|          | 2/101457092405402533877 [00:01<18155989808141202:46:24,  1.55it/s]
Training:   0%|          | 1/101457092405402533877 [00:00<21823230788576069:24:16,  1.29it/s]
Training:   0%|          | 3/101457092405402533877 [00:01<16904276494085920:59:44,  1.67it/s]
Training:   0%|          | 2/101457092405402533877 [00:01<18189075729949848:27:44,  1.55it/s]
Training:   0%|          | 0/101457092405402533877 [00:00<?, ?it/s]
Training:   0%|          | 4/101457092405402533877 [00:02<16324377596747498:22:56,  1.73it/s]

Ludwig输出了大量的步骤和周期数,以及早停策略的负值。这感觉是一个很大的bug,其中一些内部机制出现了严重的问题。特别是,当 time_budget_s 达到给定的hyperopt实验之前就已经开始训练时,这种情况似乎就会发生。

Stage 0: : 3it [00:03,  1.28it/s] pid=12614) 
Stage 0: : 2it [00:03,  1.48s/it]                     
Stage 0: : 3it [00:03,  1.15it/s] pid=12615) 3.51s/it]
2022-08-04 16:26:11,494 INFO stopper.py:350 -- Reached timeout of 20 seconds. Stopping all trials.
== Status ==
Current time: 2022-08-04 16:26:11 (running for 00:00:24.99)
Memory usage on this node: 9.1/246.4 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 50.000: None
Resources requested: 0/150 CPUs, 0/12 GPUs, 0.0/500.0 GiB heap, 0.0/500.0 GiB objects (0.0/3.0 accelerator_type:T4)
Result logdir: /home/ray/src/results/criteo_hyperopt
Number of trials: 4/4 (4 TERMINATED)
+-------------------+------------+---------------------+-----------------------+-------------------------+
| Trial name        | status     | loc                 |   trainer.decay_steps |   trainer.learning_rate |
|-------------------+------------+---------------------+-----------------------+-------------------------|
| trial_c467a_00000 | TERMINATED | 192.168.44.226:5383 |                 10000 |                   0.001 |
| trial_c467a_00001 | TERMINATED | 192.168.86.8:12066  |                  2000 |                   0.005 |
| trial_c467a_00002 | TERMINATED | 192.168.72.27:697   |                 10000 |                   0.005 |
| trial_c467a_00003 | TERMINATED | 192.168.44.226:5414 |                  8000 |                   0.001 |
+-------------------+------------+---------------------+-----------------------+-------------------------+

紧接着,模型似乎开始使用上面显示的步数和周期数进行训练,但实际上,试验已经被取消,实际上不应该进行训练。似乎存在某种协调问题,Tune失败/停止与试验/模型初始化之间存在某种问题。

重现

重现行为的方法:

  1. 使用任何数据集(稍微大一点,以便需要一段时间才能开始第一个训练周期)
  2. 将hyperopt执行器设置为以下内容:
executor:
    type: ray
    scheduler:
      type: async_hyperband
      max_t: 50
      time_attr: time_total_s
      grace_period: 50
      reduction_factor: 5
    num_samples: 4
    time_budget_s: 20
    cpu_resources_per_trial: 1
    gpu_resources_per_trial: 1

这里要确保您的 time_budget_s 小于实际开始训练之前的那段时间。主要目标是在试验甚至开始训练之前终止它们。
请提供代码、yaml配置文件和示例数据以完全重现该问题。无法重现的问题将被忽略。

预期行为

训练永远不应该开始,应该返回一个空的hyperopt_results对象。理想情况下,有一个非常明确的警告,表明事情已经失败了。

环境(请完成以下信息):

  • OS:例如iOS
  • Python版本:3.8.13
  • Ludwig版本:0.6.dev
bmvo0sr5

bmvo0sr51#

几个问题:

  1. 我假设负数表示应该一直训练到时间耗尽。在这种情况下,这部分可能没问题,但大步/轮次仍然看起来像是一个bug。
  2. 最终,所有的Ray Actors死亡,整个训练任务似乎会无限期地冻结。
  3. 最终会显示"Hyperopt finished",解除挂起状态。检查对象显示没有找到最佳试验,这是有道理的。
vqlkdk9b

vqlkdk9b2#

这似乎发生在time_budget_s被击中时,但试验从未开始训练,我们看到以下警告信息:

(raylet) [2022-09-14 05:44:38,219 E 27846 27846] (raylet) worker_pool.cc:518: Some workers of the worker process(32975) have not registered within the timeout. The process is still alive, probably it's hanging during start.
(raylet) [2022-09-14 05:44:44,889 E 27846 27846] (raylet) worker_pool.cc:518: Some workers of the worker process(33199) have not registered within the timeout. The process is still alive, probably it's hanging during start.

此时试验似乎已经终止,但训练似乎在试验终止后开始,然后我们看到类似以下内容:

(raylet) [2022-09-14 05:44:38,219 E 27846 27846] (raylet) worker_pool.cc:518: Some workers of the worker process(32975) have not registered within the timeout. The process is still alive, probably it's hanging during start.
(raylet) [2022-09-14 05:44:44,889 E 27846 27846] (raylet) worker_pool.cc:518: Some workers of the worker process(33199) have not registered within the timeout. The process is still alive, probably it's hanging during start.
(pid=32547) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=32555) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=32553) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=32554) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]
Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]
Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]
Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=33996) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=33993) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]
Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=34448) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(PipelineSplitExecutorCoordinator pid=32554) 
Stage 0: : 2it [00:18,  9.17s/it]             .33s/it]
Stage 0: : 2it [00:18,  9.20s/it]             
Stage 0: : 3it [00:18,  5.50s/it] pid=32554) 8.38s/it]
Stage 0: : 3it [00:18,  5.49s/it] pid=32547)          
Stage 0: 100%|██████████| 1/1 [00:18<00:00, 18.77s/it]
Stage 0: 100%|██████████| 1/1 [00:18<00:00, 18.77s/it]
Stage 0: : 2it [00:19,  7.93s/it]                     
Stage 0: : 2it [00:19,  7.98s/it]                     
Stage 0: 100%|██████████| 1/1 [00:07<00:00,  7.93s/it]
Stage 0: 100%|██████████| 1/1 [00:08<00:00,  8.11s/it]
Stage 0: : 2it [00:08,  3.57s/it]                     
Stage 0: : 2it [00:08,  3.57s/it]                     
(BaseWorkerMixin pid=31646) Replacing 'combined' validation field with 'Survived' as the specified validation metric loss is invalid for 'combined' but is valid for 'Survived'.
(BaseWorkerMixin pid=31647) Replacing 'combined' validation field with 'Survived' as the specified validation metric loss is invalid for 'combined' but is valid for 'Survived'.
Stage 0: : 4it [00:20,  4.18s/it] pid=32554) 
Stage 0: : 4it [00:20,  4.16s/it] pid=32547) 
(pid=35120) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
Stage 0: : 5it [00:29,  5.68s/it] pid=32554) 
Stage 0: : 5it [00:29,  5.71s/it] pid=32547) 
(BaseWorkerMixin pid=31647) Training for 46116860184273879035 step(s), approximately 9223372036854775808 epoch(s).
(BaseWorkerMixin pid=31647) Early stopping policy: -1 round(s) of evaluation, or -5 step(s), approximately -1 epoch(s).
(BaseWorkerMixin pid=31647) 
(BaseWorkerMixin pid=31647) Starting with step 0, epoch: 0
(BaseWorkerMixin pid=31646) Training for 46116860184273879035 step(s), approximately 9223372036854775808 epoch(s).
(BaseWorkerMixin pid=31646) Early stopping policy: -1 round(s) of evaluation, or -5 step(s), approximately -1 epoch(s).
(BaseWorkerMixin pid=31646) 
(BaseWorkerMixin pid=31646) Starting with step 0, epoch: 0
Training:   0%|          | 0/46116860184273879035 [00:00<?, ?it/s]
Training:   0%|          | 1/46116860184273879035 [00:00<1891077536402636:48:00,  6.77it/s]
Training:   0%|          | 4/46116860184273879035 [00:00<741062665075141:49:52, 17.29it/s] 
Training:   0%|          | 0/46116860184273879035 [00:00<?, ?it/s]
Training:   0%|          | 1/46116860184273879035 [00:00<2681824919706373:41:20,  4.78it/s]

相关问题