我正在研究散度度量，我发现如果我自己实现计算或依赖内置库，我会得到两个不同的数字。现在，我不知道我（内置函数）做错了什么。
为了进行简单的分析，我在Python中提出了以下玩具示例：

import numpy as np
import pandas as pd
#two arrays of "events"
p=np.array([1,2,2,3,5,4,2,3,2,3,4,2,1,1,1,2,2,3,5,4,2,3,2,3,4,2,1,1,1,2,2,3,5,4,2,3,1,1,2,3,4,2,1,1,1,2,2,2,2,1,2,2,3,5,4,2,2,1,2,2,3])
q=np.array([2,3,2,4,2,3,2,3,2,3,2,3,2,2,2,2,1,2,2,3,5,4,2,3,1,1,2,3,4,2,1,1,1,2,2,3,5,4,2,3,1,1,2,3,4,2,1,1,1,2,2,2,2,1,2,2,3,5,4])

字符串
我被告知，由于我将比较两个PDF并计算它们的散度度量，因此它们每个的“样本空间”应该相同。因此，我取p和q的所有可能值，并使用它们来计算PDF。这是我的简单函数，用于计算样本空间数组中的数据数组的PDF：

def create_prob_dist(data: np.array, sample_space: np.array):
  #number of all events
  sum_of_events = sample_space.size
  #get the counts of each event via pandas.crosstab()
  data_counts = pd.crosstab(index='counts', columns=data)
  
  #create probabilities for each event
  prob_dist=dict()

  for i in sample_space:
    if i in data_counts:
      prob_dist[i]=(data_counts[i]['counts'])/sum_of_events
    else: 
      prob_dist[i]=0

  return prob_dist

型
要使用函数计算PDF，我执行以下步骤：

#get all possible discrete events from p and q
px=np.array(list(set(p))) #we use set here to remove duplicates
qx=np.array(list(set(q))) #we use set here to remove duplicates

#create all possible discrete events of both p and q
mx=np.concatenate([px,qx]) #concatenate first
mx=np.array(list(set(mx))) #remove duplicates
mx.sort() #then sort

#create true PDFs of p and q using mx
p_pdf=create_prob_dist(p, mx)
q_pdf=create_prob_dist(q, mx)

#get the probability values only from the dictionary
p_pdf=np.array(list(p_pdf.values()))
q_pdf=np.array(list(q_pdf.values()))

型
然后，我可以绘制PDF，结果符合我的预期：

plt.figure()
plt.plot(mx, q_pdf, 'g', label="Q")
plt.plot(mx, p_pdf, 'r', label="P")
plt.legend(loc="upper right")
plt.show()

型
x1c 0d1x的数据
所以，一旦我有了PDF文件，也看到了它们，我就对散度计算有了一个预期。事实上，在这种情况下，我们应该期望更接近0而不是1。

KL散度

我遵循KL散度的方程如下：

的
因此，我创建了这个简单的函数：

def KL_divergence(P:np.array, Q: np.array):
  KL=0
  for i,x in enumerate(P):
    if ((Q[i] != 0) and (x != 0)): #avoid dividing with 0 and avoid having 0 in math.log()
        KL += x * math.log(x / Q[i])
  return KL

型
注意，在我的例子中，P和Q已经准备好了（这意味着它们是用相同的样本空间计算的）
我将我的计算与内置函数进行了比较：

from scipy.special import kl_div,rel_entr
print("KL divergence of p and q       : {}".format(KL_divergence(p_pdf,q_pdf)))
kl_divergence=kl_div(p_pdf,q_pdf)
print("KL divergence (lib) of p and q : {}".format(sum(kl_divergence)))
print("KL divergence (lib2) of p and q: {}".format(sum(rel_entr(p_pdf, q_pdf))))

型
我得到以下输出：

KL divergence of p and q       : 0.4900499180923177
KL divergence (lib) of p and q : 0.09004991809231755
KL divergence (lib2) of p and q: 0.4900499180923177

型
rel_entr（）给出了与我的度量相同的度量，但kl_div（）给出了完全不同的度量。
你觉得呢？哪一个是正确的，为什么？

JS发散

由于JS divergence是KL divergence的标准化/平衡版本，我还计算了它并将其与内置函数进行了比较。我发现了两个略有不同的定义。一个是做一个双向的KL散度比较，并得到平均值。另一个来自Wikipedia，使用混合分布，也被描述为

因此，我有以下JS实现（使用我的KL函数）

#the simple JS
def JS_divergence(P:np.array, Q:np.array):
  KL_P_Q=KL_divergence(P, Q)
  KL_Q_P=KL_divergence(Q, P)

  JS=(KL_P_Q+KL_Q_P)/2
  return JS   

# Wikipedia version
def mod_JS_divergence(P:np.array, Q:np.array):
  #create M
  M=(P+Q)/2 #sum the two distributions then get average

  KL_P_Q=KL_divergence(P, M)
  KL_Q_P=KL_divergence(Q, M)

  JS=(KL_P_Q+KL_Q_P)/2
  return JS

型
这是获得结果的代码，其中也包括使用内置函数。

from scipy.spatial.distance import jensenshannon
print("JS divergence of p and q       : {}".format(JS_divergence(p_pdf,q_pdf)))
print("mod JS divergence of p and q   : {}".format(mod_JS_divergence(p_pdf,q_pdf)))

js_divergence=jensenshannon(p_pdf,q_pdf)
print("JS divergence (lib) of p and q : {}".format(js_divergence))

型
产出：

JS divergence of p and q       : 0.08763662020764684
mod JS divergence of p and q   : 0.021872274274735898
JS divergence (lib) of p and q : 0.041044079757403054

型
我现在更关心我的JS散度计算，因为我的函数都没有返回与内置函数相同的结果。
我的问题还是一样的：我做错了什么内置函数与我的计算有何不同？你们知道吗？

感谢Seon的帮助。因此，解决了以下问题：

def create_prob_dist(data: np.array, sample_space: np.array):
  #number of all events
  sum_of_events = data.size #REPLACED sample_space.size to get normalized PDFs
  #get the counts of each event via pandas.crosstab()
  data_counts = pd.crosstab(index='counts', columns=data)
  
  #create probabilities for each event
  prob_dist=dict()

  for i in sample_space:
    if i in data_counts:
      prob_dist[i]=(data_counts[i]['counts'])/sum_of_events
    else: 
      prob_dist[i]=0

  return prob_dist

字符串
因此，通过将适当的y轴设置在1.0下方，PDF图是正确的。

的数据
通过将np.sqrt（）添加到返回值来纠正JS散度函数

# Wikipedia version
def mod_JS_divergence(P:np.array, Q:np.array):
  #create M
  M=(P+Q)/2 #sum the two distributions then get average

  KL_P_Q=KL_divergence(P, M)
  KL_Q_P=KL_divergence(Q, M)

  JS=(KL_P_Q+KL_Q_P)/2
  return np.sqrt(JS)

型
结果现在很好，并且是相同的：

from scipy.spatial.distance import jensenshannon
print("JS divergence of p and q       : {}".format(JS_divergence(p_pdf,q_pdf)))
print("mod JS divergence of p and q   : {}".format(mod_JS_divergence(p_pdf,q_pdf)))

型
产出：

mod JS divergence of p and q   : 0.04104407975740311
JS divergence (lib) of p and q : 0.0410440797574027

型

numpy 数PDF的KL和JS散度分析

KL散度

JS发散

1条答案

相关问题

热门标签

最新问答