如何用NumPy得到累积分布函数？

wztqucjr 于 2023-03-02 发布在其他

关注(0)|答案(6)|浏览(158)

我想用NumPy创建一个CDF，我的代码如下：

histo = np.zeros(4096, dtype = np.int32)
for x in range(0, width):
   for y in range(0, height):
      histo[data[x][y]] += 1
      q = 0 
   cdf = list()
   for i in histo:
      q = q + i
      cdf.append(q)

我正在走数组，但是程序执行的时间很长。有一个内置的函数有这个功能，是吗？

numpy

来源：https://stackoverflow.com/questions/10640759/how-to-get-the-cumulative-distribution-function-with-numpy

6条答案

按热度按时间

nhjlsmyf1#

使用直方图是一种解决方案，但它涉及到数据的分组。这对于绘制经验数据的CDF是不必要的。设F(x)为小于x的条目数，然后它增加1，正好是我们看到测量值的位置。因此，如果我们对样本进行分类，则在每个点上我们将计数递增1（或将分数递增1/N）并将一个相对于另一个作图，我们将看到“精确的”（即，未分箱的）经验CDF。
下面的代码示例演示该方法

import numpy as np
import matplotlib.pyplot as plt

N = 100
Z = np.random.normal(size = N)
# method 1
H,X1 = np.histogram( Z, bins = 10, normed = True )
dx = X1[1] - X1[0]
F1 = np.cumsum(H)*dx
#method 2
X2 = np.sort(Z)
F2 = np.array(range(N))/float(N)

plt.plot(X1[1:], F1)
plt.plot(X2, F2)
plt.show()

它输出以下内容

赞(0）回复(0）举报 2023-03-02

83qze16e2#

我不太清楚代码的作用，但是如果numpy.histogram返回hist和bin_edges数组，则可以使用numpy.cumsum生成直方图内容的累积和。

>>> import numpy as np
>>> hist, bin_edges = np.histogram(np.random.randint(0,10,100), normed=True)
>>> bin_edges
array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ])
>>> hist
array([ 0.14444444,  0.11111111,  0.11111111,  0.1       ,  0.1       ,
        0.14444444,  0.14444444,  0.08888889,  0.03333333,  0.13333333])
>>> np.cumsum(hist)
array([ 0.14444444,  0.25555556,  0.36666667,  0.46666667,  0.56666667,
        0.71111111,  0.85555556,  0.94444444,  0.97777778,  1.11111111])

赞(0）回复(0）举报 2023-03-02

p3rjfoxz3#

numpy版本1.9.0的更新。user545424的答案在1.9.0中不起作用。

>>> import numpy as np
>>> arr = np.random.randint(0,10,100)
>>> hist, bin_edges = np.histogram(arr, density=True)
>>> hist = array([ 0.16666667,  0.15555556,  0.15555556,  0.05555556,  0.08888889,
    0.08888889,  0.07777778,  0.04444444,  0.18888889,  0.08888889])
>>> hist
array([ 0.1       ,  0.11111111,  0.11111111,  0.08888889,  0.08888889,
    0.15555556,  0.11111111,  0.13333333,  0.1       ,  0.11111111])
>>> bin_edges
array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ])
>>> np.diff(bin_edges)
array([ 0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9])
>>> np.diff(bin_edges)*hist
array([ 0.09,  0.1 ,  0.1 ,  0.08,  0.08,  0.14,  0.1 ,  0.12,  0.09,  0.1 ])
>>> cdf = np.cumsum(hist*np.diff(bin_edges))
>>> cdf
array([ 0.15,  0.29,  0.43,  0.48,  0.56,  0.64,  0.71,  0.75,  0.92,  1.  ])
>>>

赞(0）回复(0）举报 2023-03-02

wbrvyc0a4#

为了补充Dan的解决方案，如果样本中有多个相同的值，可以使用numpy.unique：

Z = np.array([1,1,1,2,2,4,5,6,6,6,7,8,8])
X, F = np.unique(Z, return_index=True)
F=F/X.size

plt.plot(X, F)

赞(0）回复(0）举报 2023-03-02

v09wglhw5#

现有的答案要么诉诸于使用直方图，要么不能很好地/正确地处理重复值（要么忽略重复值，要么生成一个包含同一个x值的多个y值的CDF）。

x, CDF_counts = np.unique(data, return_counts = True)
y = np.cumsum(CDF_counts)/np.sum(CDF_counts)

赞(0）回复(0）举报 2023-03-02

qvtsj1bj6#

我不确定是否有现成的答案，确切的做法是定义一个函数，如下所示：

def _cdf(x,data):
    return(sum(x>data))

会很快的。

赞(0）回复(0）举报 2023-03-02

我来回答

如何用NumPy得到累积分布函数？

6条答案

相关问题

热门标签

最新问答