paddle 分布式训练时, Visualdl / Tensorboardx 等可视化记录 Loss/Acc ,代码崩溃, log目录错误。

pgccezyw  于 5个月前  发布在  其他
关注(0)|答案(5)|浏览(44)

bug描述 Describe the Bug

Version:
paddlepaddle-gpu 2.5.1.post117
visualdl 2.4.2

情况说明:
使用 fleet API,4 卡分布式训练时候,visualdl / tensorboardx 记录训练 acc/loss。
执行到代码出 logwriter 处,报错:FileExistsError:

[2023-08-11 08:09:01,447] [ WARNING] fleet.py:290 - The dygraph parallel environment has been initialized.
[2023-08-11 08:09:01,448] [ WARNING] fleet.py:313 - The dygraph hybrid parallel environment has been initialized.
Traceback (most recent call last):
  File "main.py", line 23, in <module>
    logwriter = LogWriter(logdir='./runs/')
  File "/home/smk/anaconda3/envs/paddle/lib/python3.8/site-packages/visualdl/writer/writer.py", line 120, in __init__
    self._get_file_writer()
  File "/home/smk/anaconda3/envs/paddle/lib/python3.8/site-packages/visualdl/writer/writer.py", line 135, in _get_file_writer
    self._file_writer = RecordFileWriter(
  File "/home/smk/anaconda3/envs/paddle/lib/python3.8/site-packages/visualdl/writer/record_writer.py", line 90, in __init__
    bfile.makedirs(logdir)
  File "/home/smk/anaconda3/envs/paddle/lib/python3.8/site-packages/visualdl/io/bfile.py", line 695, in makedirs
    return default_file_factory.get_filesystem(path).makedirs(path)
  File "/home/smk/anaconda3/envs/paddle/lib/python3.8/site-packages/visualdl/io/bfile.py", line 97, in makedirs
    os.makedirs(path)
  File "/home/smk/anaconda3/envs/paddle/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: './runs/'
LAUNCH INFO 2023-08-11 08:09:04,712 Exit code -15

其他补充信息 Additional Supplementary Information

No response

htrmnn0y

htrmnn0y1#

没人来处理这个问题么?

cnh2zyt3

cnh2zyt32#

把那个目录删除呢?或者定向到其他目录?

14ifxucb

14ifxucb3#

没有用的。尝试过来。删除也没用。默认参数也是报错的。

raogr8fs

raogr8fs4#

os.makedirs(path,exist_ok=True)
chhqkbe1

chhqkbe15#

os.makedirs(path,exist_ok=True)

ok。感谢。

相关问题