❓问题
图书馆fairseq内置了对aim的支持,但我正在努力使其正常工作。我不确定是我做错了什么,还是可能fairseq支持已经过时了,但fairseq仓库相当不活跃,所以我想在这里问一下。
我在本地运行aim server
,看到:"服务器已挂载在0.0.0.0:53800"。
然后我运行我的fairseq实验,在我的config.yaml文件中添加以下内容:
common:
aim_repo: aim://0.0.0.0:53800
然后运行我的实验。它似乎一开始就能正常工作——aim检测到实验,日志开头是:
[2023-11-15 14:31:07,453][fairseq.logging.progress_bar][INFO] - Storing logs at Aim repo: aim://0.0.0.0:53800
[2023-11-15 14:31:07,480][aim.sdk.reporter][INFO] - creating RunStatusReporter for f6f19ecf0e2147b19e24d52f
[2023-11-15 14:31:07,482][aim.sdk.reporter][INFO] - starting from: {}
[2023-11-15 14:31:07,482][aim.sdk.reporter][INFO] - starting writer thread for <aim.sdk.reporter.RunStatusReporter object at 0x7f57117363e0>
[2023-11-15 14:31:08,471][fairseq.trainer][INFO] - begin training epoch 1
[2023-11-15 14:31:08,471][fairseq_cli.train][INFO] - Start iterating over samples
[2023-11-15 14:31:10,821][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 64.0
[2023-11-15 14:31:12,261][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 32.0
[2023-11-15 14:31:12,261][fairseq_cli.train][INFO] - begin validation on "valid" subset
[2023-11-15 14:31:12,266][fairseq.logging.progress_bar][INFO] - Storing logs at Aim repo: aim://0.0.0.0:53800
[2023-11-15 14:31:12,283][fairseq.logging.progress_bar][INFO] - Appending to run: f6f19ecf0e2147b19e24d52f
但然后我遇到了一个错误:
...
File "/lib/python3.10/site-packages/fairseq/logging/progress_bar.py", line 64, in progress_bar
bar = AimProgressBarWrapper(
File "/lib/python3.10/site-packages/fairseq/logging/progress_bar.py", line 365, in __init__
self.run = get_aim_run(aim_repo, aim_run_hash)
File "/lib/python3.10/site-packages/fairseq/logging/progress_bar.py", line 333, in get_aim_run
return Run(run_hash=run_hash, repo=repo)
File "/lib/python3.10/site-packages/aim/ext/exception_resistant.py", line 70, in wrapper
_SafeModeConfig.exception_callback(e, func)
File "/lib/python3.10/site-packages/aim/ext/exception_resistant.py", line 47, in reraise_exception
raise e
File "/lib/python3.10/site-packages/aim/ext/exception_resistant.py", line 68, in wrapper
return func(*args, **kwargs)
File "/lib/python3.10/site-packages/aim/sdk/run.py", line 828, in __init__
super().__init__(run_hash, repo=repo, read_only=read_only, experiment=experiment, force_resume=force_resume)
File "/lib/python3.10/site-packages/aim/sdk/run.py", line 276, in __init__
super().__init__(run_hash, repo=repo, read_only=read_only, force_resume=force_resume)
File "/lib/python3.10/site-packages/aim/sdk/base_run.py", line 50, in __init__
self._lock.lock(force=force_resume)
File "/lib/python3.10/site-packages/aim/storage/lock_proxy.py", line 38, in lock
return self._rpc_client.run_instruction(self._hash, self._handler, 'lock', (force,))
File "/lib/python3.10/site-packages/aim/ext/transport/client.py", line 260, in run_instruction
return self._run_read_instructions(queue_id, resource, method, args)
File "/lib/python3.10/site-packages/aim/ext/transport/client.py", line 285, in _run_read_instructions
raise_exception(status_msg.header.exception)
File lib/python3.10/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception
raise exception(*args) if args else exception()
TypeError: Timeout.__init__() missing 1 required positional argument: 'lock_file'
Exception in thread Thread-13 (worker):
Traceback (most recent call last):
File "lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/lib/python3.10/site-packages/aim/ext/transport/rpc_queue.py", line 55, in worker
if self._try_exec_task(task_f, *args):
File "/lib/python3.10/site-packages/aim/ext/transport/rpc_queue.py", line 81, in _try_exec_task
task_f(*args)
File "/lib/python3.10/site-packages/aim/ext/transport/client.py", line 301, in _run_write_instructions
raise_exception(response.exception)
File "/python3.10/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception
raise exception(*args) if args else exception()
aim.ext.transport.message_utils.UnauthorizedRequestError: 3310c526-aa51-47ef-ba87-fbf75f80f610
有人知道这可能是为什么吗/或者我采取的方法有误吗?我已经尝试了各种不同的aim版本(回到fairseq更积极开发时的版本),但仍然出现错误。
4条答案
按热度按时间insrf1ej1#
将@tmynn添加到此线程中,因为他已经将集成在一起。
pw9qyyiw2#
@SGevorg, @henrycharlesworth,似乎这条线指向了真正的错误:
@henrycharlesworth,你使用的是Aim的最新版本吗?
ep6jt1vc3#
我认为是这样——使用3.17.5版本。我尝试过一些较早的版本,但似乎没有帮助。
fcwjkofz4#
是否有解决此问题的方法?当我尝试使用哈希检索现有运行时,我一直收到这个错误。