❓问题
我正在设置一个PyTorch Lightning实验,并使用AimLogger对象记录训练/验证损失以及测试结果。
然而,在trainer.fit
跟踪过程中运行正常,但在trainer.test
尝试加载模型时(TypeError: Timeout.__init__() missing 1 required positional argument: 'lock_file'
是最终抛出的异常)出现问题。我的解决方法是在fit循环拆卸程序中禁用logger.finalize()
调用,但这不应该是一个好解决方案。
这种行为是由于PyTorch-Lightning处理拆卸操作的方式发生了变化,而AimStack没有跟踪到吗?还是有什么设置可以防止这种情况发生?
我正在使用远程仓库,但已经确认如果我只使用本地仓库,那么就没有问题。这让我觉得可能有一些计时定义在这里起作用......
以下是我设置的代码片段:
import pytorch_lightning as pl
from aim.pytorch_lightning import AimLogger
model = Model() # subclass of pl.LightningModule
dataset = DataModule() # subclass of pl.LightningDataModule
logger = AimLogger(
repo="aim://<server_address>:53800",
experiment="experiment"
)
callbacks = AimCallbacks() # subclass of pl.callbacks.Callback
trainer = pl.Trainer(
log_every_n_steps=20,
logger=logger,
callbacks=callbacks
)
trainer.fit(model, datamodule=dataset)
# all goes fine untill here
trainer.test(ckpt_path="best", datamodule=dataset) # gives the Timeout TypeError if fit teardown of logger is not deactivated
我还添加了收到的traceback信息:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/user/project_path/experiments/data_generation.py", line 59, in <module>
main()
File "/home/user/project_path/experiments/data_generation.py", line 55, in main
train_model()
File "/home/user/project_path/experiments/data_generation.py", line 50, in train_model
print(trainer.logger.experiment)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/pipenv_path/lib/python3.11/site-packages/lightning_fabric/loggers/logger.py", line 118, in experiment
return fn(self)
^^^^^^^^
File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/sdk/adapters/pytorch_lightning.py", line 80, in experiment
self._run = Run(
^^^^
File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 70, in wrapper
_SafeModeConfig.exception_callback(e, func)
File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 47, in reraise_exception
raise e
File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 68, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/sdk/run.py", line 828, in __init__
super().__init__(run_hash, repo=repo, read_only=read_only, experiment=experiment, force_resume=force_resume)
File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/sdk/run.py", line 276, in __init__
super().__init__(run_hash, repo=repo, read_only=read_only, force_resume=force_resume)
File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/sdk/base_run.py", line 50, in __init__
self._lock.lock(force=force_resume)
File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/storage/lock_proxy.py", line 38, in lock
return self._rpc_client.run_instruction(self._hash, self._handler, 'lock', (force,))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/transport/client.py", line 260, in run_instruction
return self._run_read_instructions(queue_id, resource, method, args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/transport/client.py", line 285, in _run_read_instructions
raise_exception(status_msg.header.exception)
File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception
raise exception(*args) if args else exception()
^^^^^^^^^^^
TypeError: Timeout.__init__() missing 1 required positional argument: 'lock_file'
4条答案
按热度按时间ldxq2e6h1#
你好,你是如何解决这个问题的?
在Lightning中,你是否需要更改拆卸功能?
禁用倒数第二行?
yb3bgrhw2#
补丁说明似乎表示这个问题已经修复了(https://aimstack.readthedocs.io/en/latest/generated/CHANGELOG.html#feb-7-2024-fixes),但是当我使用aim时,代码在进入测试循环时挂起。没有引发错误,但是训练循环刚结束时,运行就被标记为完成,然后在尝试从测试循环记录任何内容时,代码挂起。
7uzetpgm3#
你好,你是如何解决这个问题的?
在lightning中,你是否需要更改拆卸函数?
禁用倒数第二行?
你好,是的,完全正确。尽管这个解决方案不是很好,并且迫使我在整个过程完成后手动拆卸记录器。
x6yk4ghg4#
看起来 #3134 将解决这个问题。