numpy ArrayMemoryError with converting column to str

vktxenjb 于 2023-08-05 发布在其他

关注(0)|答案(1)|浏览(80)

我有一个中等大小的csv文件，大约100MB。它包含20，000行。
由于某些原因，我无法执行以下操作：

ubuntu@ip-172-31-42-52:~/sunshine(XXX)$ python
Python 3.10.5 (main, Oct  1 2022, 00:47:42) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.read_csv('/home/ubuntu/XXX_dataset__full.csv')
>>> df.memory_usage()
Index          128
url         160000
FOO         160000
BAR         160000
text        160000
dtype: int64
>>> docs = df['text'].values.astype(str)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 256. GiB for an array with shape (20000,) and data type <U3430166
>>> quit()

字符串
我的ubuntu上的可用RAM：

MemTotal:       129196996 kB

型
为什么numpy要估算这么多内存？

numpy

来源：https://stackoverflow.com/questions/76599598/numpy-arraymemoryerror-with-converting-column-to-str

1条答案

按热度按时间

ogq8wdun1#

“and data type <U3430166”表示text列中至少有一行包含3430166个字符的字符串。NumPy字符串数组需要为每行分配相同的空间，每个字符4个字节，因此它需要为您尝试创建的20000行数组的 * 每一行 * 分配3430166*4个字节。这大约是256 GiB或274 GB，是可用RAM的两倍。
请注意，原始数据框的text列没有NumPy字符串数组支持。它由一个对象数组支持，保存对普通Python字符串对象的引用。默认情况下，memory_usage不会报告字符串对象消耗的内存。

赞(0）回复(0）举报 2023-08-05

我来回答

numpy ArrayMemoryError with converting column to str

1条答案

相关问题

热门标签

最新问答