python 如何解决scipy csc-matrix关于索引指针应该在for循环中从0开始的警告？

j9per5c4 于 2023-05-21 发布在 Python

关注(0)|答案(1)|浏览(179)

我们分析了一个大数据集，并将数据保存为Julia语言中的csc sparse array，然后我尝试使用scipy的csc_matrix在python语言中加载数组。然而，我的PC没有足够大的RAM来处理数据（我看到了警告：无法为具有形状的数组分配93.gib）。
因此，我希望在for循环下将这些数据作为小部分处理。

def get_jld(filename):
    orient_list=[]
    ID_num=999
    f = h5py.File(filename, 'r')
    data= f["sm"][()]

    column_ptr=f[data[2]][:]-1 ## correct indexing from julia (starts at 1)
    indices=f[data[3]][:]-1 ## correct indexing
    values =f[data[4]][:]
    for i in range(0, data[1], ID_num):
        out = pd.DataFrame(csc_matrix((values,indices,column_ptr[i:i+ID_num+1]), shape=(data[0],ID_num)).toarray())
        orient_angle=find_orientation(out)
        orient_list.append(orient_angle)
    f.close()
return orient_list

然而，csc_matrix发送了一个警告，说“索引指针应该从0开始”，这意味着每当我到达这个循环的第二次迭代时，csc_matrix将停止（不循环整个稀疏数组）。

ValueError                                Traceback (most recent call last)

Cell In[6], line 2
      1 filename=vid_list[0]
----> 2 test = get_jld(filename)

Cell In[3], line 13, in get_jld(filename)
     10 #indices[indices < time_start] = time_start
     11 #print([data[0],data[1]])
     12 for i in range(0, data[1], ID_num):
---> 13     out =pd.DataFrame(csc_matrix((values,indices,column_ptr[i:i+ID_num+1]), shape=(data[0],ID_num)).toarray())
     14     orient_angle=find_orientation(out)
     15     orient_list.append(orient_angle)

File ~\anaconda3\envs\qlm_analysis\lib\site-packages\scipy\sparse\_compressed.py:106, in _cs_matrix.__init__(self, arg1, shape, dtype, copy)
    103 if dtype is not None:
    104     self.data = self.data.astype(dtype, copy=False)
--> 106 self.check_format(full_check=False)

File ~\anaconda3\envs\qlm_analysis\lib\site-packages\scipy\sparse\_compressed.py:172, in _cs_matrix.check_format(self, full_check)
    169     raise ValueError("index pointer size ({}) should be ({})"
    170                      "".format(len(self.indptr), major_dim + 1))
    171 if (self.indptr[0] != 0):
--> 172     raise ValueError("index pointer should start with 0")
    174 # check index and data arrays
    175 if (len(self.indices) != len(self.data)):

ValueError: index pointer should start with 0

有人有什么建议来解决这个问题吗？

python

来源：https://stackoverflow.com/questions/76290401/how-to-work-around-the-warning-of-scipy-csc-matrix-about-index-pointer-should-st

1条答案

按热度按时间

kuarbcqp1#

让我们做一个csc矩阵（10 x10，密度为.2）

In [68]: from scipy import sparse

In [69]: M = sparse.random(10,10,.2, 'csc'); M
Out[69]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Column format>

关键属性包括：

In [71]: M.data, M.indices, M.indptr
Out[71]: 
(array([0.56697998, 0.53160297, 0.69832231, 0.54057255, 0.86921503,
        0.15293358, 0.51262743, 0.52701144, 0.52747434, 0.32865321,
        0.01983809, 0.59817796, 0.24101313, 0.17601203, 0.04915645,
        0.74543831, 0.62962088, 0.69383003, 0.45157457, 0.31947605]),
 array([9, 1, 2, 3, 7, 9, 6, 9, 6, 0, 4, 5, 6, 8, 2, 5, 9, 5, 0, 3]),
 array([ 0,  1,  6,  6,  8,  9, 14, 17, 18, 18, 20]))

从中我可以看到第一列在第9行有一个nonero值-data中的第一个元素：

In [72]: M[9,0]
Out[72]: 0.5669799764654442

类似地，最后一个data元素是第9列和第3行（indices的最后一个值）：

In [73]: M[3,9]
Out[73]: 0.3194760497137821

indptr从0开始，有11个元素。indptr值告诉我们/python如何将data和indices值拆分为连续的列。
因此，如果从'原始'值制作一个矩阵，他们可以匹配这个模式。
通常从coo风格的输入创建一个矩阵更容易，但是如果从其他csc矩阵开始，或者如果你知道你在做什么，使用这个“原始”输入是可能的。它更快，但也更容易搞砸。
我还没有看过你的图片，但我认为你是一个大的csc字符串，并试图从它的data/indices/indptr的子集使较小的csc。我自己还没有试过这样做，所以我不知道你必须采取什么调整才能做好。
（我通常使用CSR格式，所以我不得不在精神上将描述切换到CSC。）

赞(0）回复(0）举报 2023-05-21

我来回答

python 如何解决scipy csc-matrix关于索引指针应该在for循环中从0开始的警告？

1条答案

相关问题

热门标签

最新问答