在我的数据上使用scikit-learn的DictVectorizer时，如何避免这个Numpy ArrayMemoryError？

mkshixfv 于 12个月前发布在其他

关注(0)|答案(1)|浏览(130)

当我尝试在我的数据上使用scikit-learn的DictVectorizer时，我得到了numpy.core._exceptions._ArrayMemoryError。
我在Windows 10上的PyCharm中使用Python 3.9，我的系统有64 GB的RAM。
我正在预处理文本数据以训练Keras POS-tagger。数据以这种格式开始，每个句子都有标记列表：

sentences = [['Eorum', 'fines', 'Nervii', 'attingebant'], ['ait', 'enim'], ['scriptum', 'est', 'enim', 'in', 'lege', 'Mosi'], ...]

字符串
然后，我使用以下函数从数据集中提取有用的特征：

def get_word_features(words, word_index):
    """Return a dictionary of important word features for an individual word in the context of its sentence"""
    word = words[word_index]
    return {
        'word': word,
        'sent_len': len(words),
        'word_len': len(word),
        'first_word': word_index == 0,
        'last_word': word_index == len(words) - 1,
        'start_letter': word[0],
        'start_letters-2': word[:2],
        'start_letters-3': word[:3],
        'end_letter': word[-1],
        'end_letters-2': word[-2:],
        'end_letters-3': word[-3:],
        'previous_word': '' if word_index == 0 else words[word_index - 1],
        'following_word': '' if word_index == len(words) - 1 else words[word_index + 1]
    }

word_dicts = list()
for sentence in sentences:
    for index, token in enumerate(sentence):
        word_dicts.append(get_word_features(sentence, index))

型
这种格式的数据不是很大，好像只有3.3MB左右。
接下来，我设置DictVectorizer，将其适配到数据，并尝试使用它转换数据：

from sklearn.feature_extraction import DictVectorizer

dict_vectoriser = DictVectorizer(sparse=False)
dict_vectoriser.fit(word_dicts)
X_train = dict_vectoriser.transform(word_dicts)

型
在这一点上，我得到了这个错误：

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 499. GiB for an array with shape (334043, 200643) and data type float64

型
这似乎表明DictVectorizer已经大量增加了数据的大小，接近500 GB。这是正常的吗？输出真的应该占用这么多内存吗？或者我做错了什么？
我寻找解决方案和in this thread有人建议通过进入Windows设置和SystemPropertiesAdvanced分配更多的虚拟内存，取消选中Automatically manage paging file size for all drives，然后手动设置分页文件大小到足够大的量.这将是好的，如果任务需要约100 GB，但我没有足够的存储空间分配500 GB的任务.
有什么解决方案吗？或者我只需要去买一个更大的驱动器来拥有足够大的页面文件？这似乎不切实际，特别是当初始数据集不是特别大的时候。

numpy

来源：https://stackoverflow.com/questions/77741464/how-can-i-avoid-this-numpy-arraymemoryerror-when-using-scikit-learns-dictvector

1条答案

按热度按时间

j7dteeu81#

我想出了一个解决方案。如果它对任何人都有用，在这里。我在工作流程的后期使用了数据生成器，只是为了将数据馈送到GPU进行批量处理。

class DataGenerator(Sequence):
    def __init__(self, x_set, y_set, batch_size):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size

    def __len__(self):
        return int(np.ceil(len(self.x) / float(self.batch_size)))

    def __getitem__(self, idx):
        batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
        return batch_x, batch_y

字符串
根据我在这里得到的评论，我最初尝试将这里的输出更新为return batch_x.todense()，并将上面的代码更改为dict_vectoriser = DictVectorizer(sparse=True)。
我现在已经改变了生成器，这样，一旦dict_vectoriser被创建并拟合到数据，它就作为参数传递给数据生成器，并且在使用生成器之前不会调用它来转换数据。

class DataGenerator(Sequence):
    def __init__(self, x_set, y_set, batch_size, x_vec):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size
        self.x_vec = x_vec

    def __len__(self):
        return int(np.ceil(len(self.x) / float(self.batch_size)))

    def __getitem__(self, idx):
        batch_x = self.x_vec.transform(self.x[idx * self.batch_size:(idx + 1) * self.batch_size])
        batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
        return batch_x, batch_y

型
要调用它，您需要设置batch_size并提供标签，因此在y_train下面是与x_train数据对应的标签的编码列表。

dict_vectoriser = DictVectorizer(sparse=False)
dict_vectoriser.fit(word_dicts)
train_generator = DataGenerator(word_dicts, y_train, 200, dict_vectoriser)

型

赞(0）回复(0）举报 12个月前

我来回答

在我的数据上使用scikit-learn的DictVectorizer时，如何避免这个Numpy ArrayMemoryError？

1条答案

相关问题

热门标签

最新问答