gensim ``` Ambiguous docvecs indexing, documentation missing ```

ezykj2lf 于 4个月前发布在其他

关注(0)|答案(4)|浏览(63)

描述

训练语料库的标记系统似乎导致了model.docvecs上的模糊索引。

步骤/代码/语料库以重现

此示例是the main demo中代码片段的轻微修改版本。

import os
import smart_open
import gensim

# Set file names for train and test data
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'

def read_corpus(fname, offset = 0):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i + offset])

def build_model(corpus):
    model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=55)
    model.build_vocab(corpus)
    model.train(corpus, total_examples=model.corpus_count, epochs=model.epochs)
    return model

corpus = list(read_corpus(lee_train_file))
offset_corpus = list(read_corpus(lee_train_file, offset = 100))

model = build_model(corpus)
offset_model = build_model(offset_corpus)

预期结果

两个模型之间的唯一区别在于训练语料库使用不同的标签集。

两个模型的训练文档数量相同，因此我期望len(model.docvecs) == len(offset_model.docvecs)。
offset_corpus.docvecs[0] 应该抛出一个错误，因为0不在它的标签中。
或者，如果不满足(2),则offset_corpus.docvecs[0]应该对应于第一个文档，在这种情况下，offset_corpus.docvecs[0] == offset_corpus.docvecs[100]。具体来说，请注意offset_corpus.docvecs[399]似乎返回了一个有效结果。

实际结果

(1)、(2)或(3)都不成立。

版本

>>> import platform; print(platform.platform())
Darwin-16.7.0-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
Python 3.6.1 |Anaconda 4.4.0 (x86_64)| (default, May 11 2017, 13:04:09)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.14.2
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.0.1
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.4.0
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 0

gensim

来源：https://github.com/piskvorky/gensim/issues/2097

4条答案

按热度按时间

lxkprmvk1#

这是预期的和经过设计的。如果你选择使用原始Python int s作为文档标签，为了节省内存，不会维护额外的标签到实际索引的Map。
相反，提供的整数将被当作你为每个向量请求的实际索引，并分配一个足够大的后备向量数组来包含你最大的索引，而不管是否有任何缺失的索引。如果你留下一些未使用的索引——比如在你的例子中是0-99——这些空间仍然会被分配，甚至会用随机低幅度向量进行初始化。
如果你只想精确地分配正确数量的向量，可以(1)确保所有纯整数标签从零开始，并通过连续的数字递增；或者(2)仅使用字符串标签，以便保持额外的字典Map，将提供标签Map到实际索引。

赞(0）回复(0）举报 4个月前

6tdlim6h2#

@zkurtz,您期望在哪里看到这些信息？您首先在哪里寻找解决方案/解释？
我之所以问这个问题，是因为相同的问题一次又一次地出现(doc2vec中的int/str标签不匹配导致用户困惑API),所以我们显然需要在某个地方改进文档。

赞(0）回复(0）举报 4个月前

46qrfjad3#

@gojomo 我最终推断出了很多(但“为这些预留的空间仍然分配了，甚至用随机低幅度向量初始化”这对我来说是令人惊讶的)。我不知道这个“设计”背后的原因，但我想它在一些简单的文档更新后会减少问题。
@piskvorky 我最初的目标只是将训练好的文档向量放入一个列表中，我最终将其转换为pandas数据框作为分类器的输入特征。我想这是一个常见的用例，应该在the tutorial that I started with中有空间。我最后找到的最“索引安全”的解决方案是这个：[model.docvecs[c.tags[0]] for c in train_corpus]。我想知道是否有一个与标签无关的更符合习惯的方法？
当我感到困惑时，实际上我去查看了源代码，但无法识别相关的注解。在这个教程中提到的和源代码中的docvecs类的前导部分是我在这个领域找到最有帮助的新文档的地方。

赞(0）回复(0）举报 4个月前

8mmmxcuj4#

感谢zkurtz的报告，我们确实需要为这个案例更新文档，至少包括：
如果只想分配恰好正确数量的向量，可以(1)确保所有使用的基本整数标签从零开始并通过连续的数字递增；或者(2)仅使用字符串标签，以便保持提供的标签到实际索引的额外字典Map。
因为这部分并不简单(用户不期望这种行为)。
此外，还需要添加一些代码示例来演示这种行为(也缺少索引)。

赞(0）回复(0）举报 4个月前

我来回答

gensim ``` Ambiguous docvecs indexing, documentation missing ```

描述

步骤/代码/语料库以重现

预期结果

实际结果

版本

4条答案

相关问题

热门标签

最新问答