使用单元格信息和位置在numpy数组上Map函数的最有效方法

elcex8rz 于 2023-03-23 发布在其他

关注(0)|答案(1)|浏览(134)

我有一个语料库，包含 m 个文档和 n 个独特的单词。
基于这个语料库，我想计算一个词的共现矩阵，并计算它们的相似度。
为此，我创建了一个NumPy数组occurrences（m x n），它指示每个文档中存在哪些单词。
基于occurrences，我创建了cooccurrences如下：

cooccurrences = np.transpose(occurrences) @ occurrences

此外，word_occurrences给出了语料库中每个单词的总和：

word_occurrences = occurrences.sum(axis=0)

现在，我想根据关联强度计算cooccurrences中单词的相似性得分。
我想将cooccurrences中的每个单元格 i，j 除以word_occurrences[i] * word_occurrences[j]。
目前，我通过cooccurrences循环来实现这一点。

def calculate_association_strength(cooc, i, j, word_occurrences):
        return cooc/(word_occurrences[i]*word_occurrences[j])

for i in range(len(cooccurrences)):
            for j in range(len(cooccurrences)):
                if i != j:
                    if cooccurrences[i,j] > 0 :
                        cooccurrences[i,j] = 1 - self.calculate_association_strength(cooccurrences[i,j], i,j,word_occurrences)
                else:
                    cooccurrences[i,j] = 0

但是对于 m〉30000，这是非常耗时的。是否有更快的方法来完成此操作？
Here，他们讨论了在np.array上Map函数。然而，他们没有使用从数组派生的多个变量。

numpy

来源：https://stackoverflow.com/questions/70460986/most-efficient-way-to-map-function-on-numpy-array-using-cell-information-and-loc