tensorflow 2.0 keras将模型保存到hdfs:无法递减id引用计数

qvsjd97n  于 2023-04-30  发布在  HDFS
关注(0)|答案(2)|浏览(156)

我已经通过hdfs fuse安装了hdfs驱动器,因此我可以通过路径/hdfs/xxx访问hdfs。
在用keras训练了一个模型之后,我想通过model.save("/hdfs/model.h5")将它保存到/hdfs/model.h5
我得到以下错误:

2020-02-26T10:06:51.83869705Z   File "h5py/_objects.pyx", line 193, in h5py._objects.ObjectID.__dealloc__
2020-02-26T10:06:51.838791107Z RuntimeError: Can't decrement id ref count (file write failed: time = Wed Feb 26 10:06:51 2020
2020-02-26T10:06:51.838796288Z , filename = '/hdfs/model.h5', file descriptor = 3, errno = 95, error message = 'Operation not supported', buf = 0x7f20d000ddc8, total write size = 512, bytes this sub-write = 512, bytes actually written = 18446744073709551615, offset = 298264)
2020-02-26T10:06:51.838802442Z Exception ignored in: 'h5py._objects.ObjectID.__dealloc__'
2020-02-26T10:06:51.838807122Z Traceback (most recent call last):
2020-02-26T10:06:51.838811833Z   File "h5py/_objects.pyx", line 193, in h5py._objects.ObjectID.__dealloc__
2020-02-26T10:06:51.838816793Z RuntimeError: Can't decrement id ref count (file write failed: time = Wed Feb 26 10:06:51 2020
2020-02-26T10:06:51.838821942Z , filename = '/hdfs/model.h5', file descriptor = 3, errno = 95, error message = 'Operation not supported', buf = 0x7f20d000ddc8, total write size = 512, bytes this sub-write = 512, bytes actually written = 18446744073709551615, offset = 298264)
2020-02-26T10:06:51.838827917Z Traceback (most recent call last):
2020-02-26T10:06:51.838832755Z   File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 117, in save_model_to_hdf5
2020-02-26T10:06:51.838838098Z     f.flush()
2020-02-26T10:06:51.83885453Z   File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py", line 452, in flush
2020-02-26T10:06:51.838859816Z     h5f.flush(self.id)
2020-02-26T10:06:51.838864401Z   File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
2020-02-26T10:06:51.838869302Z   File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
2020-02-26T10:06:51.838874126Z   File "h5py/h5f.pyx", line 146, in h5py.h5f.flush
2020-02-26T10:06:51.838879016Z RuntimeError: Can't flush cache (file write failed: time = Wed Feb 26 10:06:51 2020
2020-02-26T10:06:51.838885827Z , filename = '/hdfs/model.h5', file descriptor = 3, errno = 95, error message = 'Operation not supported', buf = 0x4e5b018, total write size = 4, bytes this sub-write = 4, bytes actually written = 18446744073709551615, offset = 34552)

但我可以直接将文件写入同一路径

with open("/hdfs/a.txt") as f:
    f.write("1")

我还想出了一个棘手的变通办法,它的工作。..

model.save("temp.h5")
move("temp.h5", "/hdfs/model.h5")

所以问题可能出在keras API上?它只能在本地保存模型,但不能保存到hdfs路径。
有什么办法解决这个问题吗?

lymnna71

lymnna711#

我不认为tensorflow承诺能够保存到hdfs-fuse。你的(最终)错误是“Can 't flush cache”不是,“Can 't decrease id ref count”,基本上意思是“Can 't保存straight to hdfs-fuse”。但是,说实话,它似乎固定给我,你的变通方法是好的。

f4t66c6m

f4t66c6m2#

在我的特殊情况下,这个确切的错误来自一个完整的驱动器。后台备份用临时文件填满了硬盘。
在以前的生产系统中,对于持续生成文件的长时间运行的任务,我将每分钟监视空闲驱动器空间的数量。如果它变低了,它会发出警报,这样它就可以被修复,如果它变到1GB空闲,它就会退出。最好退出1GB的免费而不是0GB的免费,这真的可以打破系统上运行的所有其他东西。这是牺牲一个任务(无论如何都注定要退出)或整个系统之间的选择。

相关问题