Pytorch-lightning AimLogger在fit之后完成,与fit和测试例程一起打破会话,

ozxc1zmp  于 5个月前  发布在  其他
关注(0)|答案(4)|浏览(128)

❓问题

我正在设置一个PyTorch Lightning实验,并使用AimLogger对象记录训练/验证损失以及测试结果。
然而,在trainer.fit跟踪过程中运行正常,但在trainer.test尝试加载模型时(TypeError: Timeout.__init__() missing 1 required positional argument: 'lock_file'是最终抛出的异常)出现问题。我的解决方法是在fit循环拆卸程序中禁用logger.finalize()调用,但这不应该是一个好解决方案。
这种行为是由于PyTorch-Lightning处理拆卸操作的方式发生了变化,而AimStack没有跟踪到吗?还是有什么设置可以防止这种情况发生?
我正在使用远程仓库,但已经确认如果我只使用本地仓库,那么就没有问题。这让我觉得可能有一些计时定义在这里起作用......
以下是我设置的代码片段:

import pytorch_lightning as pl
from aim.pytorch_lightning import AimLogger

model = Model()  # subclass of pl.LightningModule
dataset = DataModule()  # subclass of pl.LightningDataModule
logger = AimLogger(
    repo="aim://<server_address>:53800",
    experiment="experiment"
)
callbacks = AimCallbacks()  # subclass of pl.callbacks.Callback

trainer = pl.Trainer(
    log_every_n_steps=20,
    logger=logger,
    callbacks=callbacks
)

trainer.fit(model, datamodule=dataset)
# all goes fine untill here
trainer.test(ckpt_path="best", datamodule=dataset)  # gives the Timeout TypeError if fit teardown of logger is not deactivated

我还添加了收到的traceback信息:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/user/project_path/experiments/data_generation.py", line 59, in <module>
    main()
  File "/home/user/project_path/experiments/data_generation.py", line 55, in main
    train_model()
  File "/home/user/project_path/experiments/data_generation.py", line 50, in train_model
    print(trainer.logger.experiment)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/pipenv_path/lib/python3.11/site-packages/lightning_fabric/loggers/logger.py", line 118, in experiment
    return fn(self)
           ^^^^^^^^
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/sdk/adapters/pytorch_lightning.py", line 80, in experiment
    self._run = Run(
                ^^^^
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 70, in wrapper
    _SafeModeConfig.exception_callback(e, func)
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 47, in reraise_exception
    raise e
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 68, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/sdk/run.py", line 828, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, experiment=experiment, force_resume=force_resume)
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/sdk/run.py", line 276, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, force_resume=force_resume)
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/sdk/base_run.py", line 50, in __init__
    self._lock.lock(force=force_resume)
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/storage/lock_proxy.py", line 38, in lock
    return self._rpc_client.run_instruction(self._hash, self._handler, 'lock', (force,))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/transport/client.py", line 260, in run_instruction
    return self._run_read_instructions(queue_id, resource, method, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/transport/client.py", line 285, in _run_read_instructions
    raise_exception(status_msg.header.exception)
  File "/home/user/pipenv_path/lib/python3.11/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception
    raise exception(*args) if args else exception()
                                        ^^^^^^^^^^^
TypeError: Timeout.__init__() missing 1 required positional argument: 'lock_file'
ldxq2e6h

ldxq2e6h1#

你好,你是如何解决这个问题的?
在Lightning中,你是否需要更改拆卸功能?

def _teardown(self) -> None:
        """This is the Trainer's internal teardown, unrelated to the `teardown` hooks in LightningModule and
        Callback; those are handled by :meth:`_call_teardown_hook`."""
        self.strategy.teardown()
        loop = self._active_loop
        # loop should never be `None` here but it can because we don't know the trainer stage with `ddp_spawn`
        if loop is not None:
            loop.teardown()
        self._logger_connector.teardown()
        self._signal_connector.teardown()

禁用倒数第二行?

yb3bgrhw

yb3bgrhw2#

补丁说明似乎表示这个问题已经修复了(https://aimstack.readthedocs.io/en/latest/generated/CHANGELOG.html#feb-7-2024-fixes),但是当我使用aim时,代码在进入测试循环时挂起。没有引发错误,但是训练循环刚结束时,运行就被标记为完成,然后在尝试从测试循环记录任何内容时,代码挂起。

7uzetpgm

7uzetpgm3#

你好,你是如何解决这个问题的?
在lightning中,你是否需要更改拆卸函数?

def _teardown(self) -> None:
        """This is the Trainer's internal teardown, unrelated to the `teardown` hooks in LightningModule and
        Callback; those are handled by :meth:`_call_teardown_hook`."""
        self.strategy.teardown()
        loop = self._active_loop
        # loop should never be `None` here but it can because we don't know the trainer stage with `ddp_spawn`
        if loop is not None:
            loop.teardown()
        self._logger_connector.teardown()
        self._signal_connector.teardown()

禁用倒数第二行?
你好,是的,完全正确。尽管这个解决方案不是很好,并且迫使我在整个过程完成后手动拆卸记录器。

x6yk4ghg

x6yk4ghg4#

看起来 #3134 将解决这个问题。

相关问题