scipy 稀疏稀疏矩阵的并行预编码矩阵

fhg3lkii  于 2022-11-09  发布在  其他
关注(0)|答案(2)|浏览(208)

我需要将稀疏csr格式的scipy矩阵转换为PPMI加权矩阵。我有一个稀疏平方共现矩阵,每行和每列对应于单词,每个条目mat(i,j)对应于这些单词在语料库中被一起找到的次数。
以下是如何获得该矩阵的最小示例:

from sklearn.feature_extraction.text import CountVectorizer

sentences = ["The cat is on the table",
             "I have seen a cat in the office",
             "You shall feed the cat before it gets dark",
             "I have many pets in my house, but my favourite is my cat",
             "Dogs are nice, but cats are far nicer in my opinion"]
count_model = CountVectorizer(ngram_range=(1,1))
X = count_model.fit_transform(sentences) # word-by-context matrix
Xc = (X.T * X) # word-by-word co-occurrence matrix in sparse csr format
Xc.setdiag(0)

我需要的是将每个矩阵单元格转换为该值的PPMI

PPMI(i, j) = max(log2[P(i, j)/P(i)*P(j)], 0)

现在我有一个非常慢的函数,我用它来计算(i,j)的PPMI值,但我想知道是否有更有效的解决方案,因为这个解决方案不能扩展到整个矩阵(我发布的玩具矩阵是29X29,但我的矩阵是65,000X65,000)。

def ppmi(matrix, idx1, idx2):
    tot = matrix.count_nonzero()
    p_a = sum(matrix[idx1, :].toarray()[0])/tot # probability of first element
    p_b = sum(matrix[idx2, :].toarray()[0])/tot # probability of second element
    p_ab = matrix[idx1, idx2]/tot # probability of co-occurrence
    ppmi = max([np.log2(p_ab/(p_a*p_b)), 0])
    return ppmi

谢谢你,谢谢你

2lpgd968

2lpgd9681#

from scipy.sparse import csr_matrix
import numpy as np

arr = np.random.rand(100,50)
sarr = csr_matrix(arr)

不要通过循环元素来操作矩阵,这是不需要循环元素就可以完成的事情。


# Calculate the probability vectors for p_i and p_j (inverted, so it's 1/p)

# Then fix any non-finite values caused by div-0 errors

# This is easier than trying to do the division across the sparse matrix and fixing it then

total = sarr.sum()   
pr = total / sarr.sum(axis=1).A1
pc = total / sarr.sum(axis=0).A1

pr[~np.isfinite(pr)] = 0
pc[~np.isfinite(pc)] = 0

# Calculate the joint probability p_ij

sarr = sarr / total

# Calculate p_ij / p_i * p_j

sarr = sarr.multiply(pr[:, None]).multiply(pc[None, :])
sarr.eliminate_zeros()

# Calculate your metric

sarr.data = np.log2(sarr.data)
sarr.data[sarr.data > 0] = 0

你去那里。

tcomlyy6

tcomlyy62#

让我们来试试ppmi的全数组版本:

def foo(matrix):
    tot = matrix.count_nonzero()
    p_a = matrix.sum(axis=1).A1/tot  # (n,) array
    pouter = p_a[:,None]*p_a
    p_ab = matrix/tot # probability of co-occurrence
    ppmi = np.log2(p_ab/(pouter))
    ppmi = np.maximum(ppmi, 0)
    return ppmi

您的样品基质:

In [72]: Xc
Out[72]: 
<29x29 sparse matrix of type '<class 'numpy.int64'>'
    with 313 stored elements in Compressed Sparse Column format>

In [73]: M=foo(Xc)
C:\Users\paul\AppData\Local\Temp\ipykernel_2272\2466548606.py:6: RuntimeWarning: divide by zero encountered in log2
  ppmi = np.log2(p_ab/(pouter))

In [74]: M.shape
Out[74]: (29, 29)

相反,对所有索引进行迭代:

In [75]: res = np.array([[ppmi(Xc,i,j) for j in range(29)] for i in range(29)])
C:\Users\paul\AppData\Local\Temp\ipykernel_2272\132281081.py:6: RuntimeWarning: divide by zero encountered in log2
  ppmi = max([np.log2(p_ab/(p_a*p_b)), 0])

In [76]: res.shape
Out[76]: (29, 29)

值匹配(尽管Mnp.matrix):

In [77]: np.allclose(res,M)
Out[77]: True

如果没有np.maximumfoo的结果会有很多-inf。我也不确定稀疏矩阵到密集矩阵的转换在哪里发生。这可能会给更大的情况带来问题。
无论如何,计时:

In [78]: timeit res = np.array([[ppmi(Xc,i,j) for j in range(29)] for i in range(29)])
C:\Users\paul\AppData\Local\Temp\ipykernel_2272\132281081.py:6: RuntimeWarning: divide by zero encountered in log2
  ppmi = max([np.log2(p_ab/(p_a*p_b)), 0])
452 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [79]: timeit M = foo(Xc)
C:\Users\paul\AppData\Local\Temp\ipykernel_2272\2466548606.py:6: RuntimeWarning: divide by zero encountered in log2
  ppmi = np.log2(p_ab/(pouter))
438 µs ± 29.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

这需要做更多的工作,但它表明,ppmi中的许多单个元素计算可以同时对整个数组执行。

相关问题