扩展memmap艾德Numpy或Dask数组(大于可用的ram)的最佳方法是什么？

jrcvhitl 于 2023-04-30 发布在其他

关注(0)|答案(1)|浏览(92)

我在磁盘上有一个Numpy阵列，比我可用的ram大。我可以将它加载为memory-map并使用它没有问题：

a = np.memmap(filename, mode='r', shape=shape, dtype=dtype)

接下来，我可以用类似的方式Following Dask documentation:加载Dask数组

da = da.from_array(np.memmap(filename, shape=shape, dtype=dtype, mode='r'))

如何在此数组中添加行/列

理想情况下无需创建全新副本
即使必须创建全新的副本，如何处理它（它将不适合RAM）

就像

a2 = np.stack((a, new_a))

将导致整个a数组加载到内存中，并抛出Out of memory。
扩展memmap艾德Numpy或Dask数组（大于可用的ram）的最佳方法是什么？

numpy

来源：https://stackoverflow.com/questions/76097554/whats-the-best-approach-to-extend-memmaped-numpy-or-dask-arrays-bigger-than-a

1条答案

按热度按时间

bqucvtff1#

我有两个想法。
第一种方法需要创建一个全新的副本，但遵循numpy的基本用法。

# First, create a file of the required size. This must be a new file.
out = np.memmap(out_file, mode="w+", dtype=dtype, shape=out_shape)

# Copy files here via memmap.
array1 = np.memmap(in_file1, mode="r", dtype=dtype, shape=shape1)
array2 = np.memmap(in_file2, mode="r", dtype=dtype, shape=shape2)
out[:shape1[0]] = array1
out[shape1[0]:] = array2

# Don't forget to flush.
out.flush()
del out

第二种方法是简单地复制为二进制文件。使用这种方法，您将追加到第一个文件，因此至少不需要复制第一个文件。

chunk_size = 8192

# The contents of file2 are appended to file1.
with open(file2, "rb") as f2, open(file1, "ab") as f1:
    while True:
        chunk = f2.read(chunk_size)
        if not chunk:
            break
        f1.write(chunk)

请注意，我假设您的文件不包含头，因为您使用np阅读它们。如果它们确实包含头文件，那么这个过程会变得有点复杂。

赞(0）回复(0）举报 2023-04-30

我来回答

扩展memmap艾德Numpy或Dask数组(大于可用的ram)的最佳方法是什么？

1条答案

相关问题

热门标签

最新问答