numpy 以百分比形式计算两个直方图的交集

jqjz2hbq  于 2023-06-23  发布在  其他
关注(0)|答案(2)|浏览(133)

我有两组数据:
数据集1:

-4.96600134256044 
-4.78340374913002
-4.93136896680689
-4.80958108060998
-4.78688287192542
-4.9431452930913
-4.93676628405869
-4.87328189586985
-4.91867843591513
-4.72101863119006
-4.95749167305945
-4.79202404641664
-4.91265785779198
-4.94596580589554
-4.96595222256787
-4.7990191635208
-4.97194852291884
-4.78515347272161
-4.78340374913002
-4.8994168374135
-4.97206198058066
-4.95252689510477
-4.93963055552644
-4.95490836707013
-4.94133564424905
-4.78567577865158
-4.93963055552644
-4.93131563559386
-4.9710618452962
-4.90015209439797
-4.9665194453887
-4.93403567225855
-4.91165041153205
-4.85009602823937
-4.78340374913002
-4.77292978439906
-4.94782851444531
-4.64848347534667
-4.91165041153205
-4.82937702765807
-4.96202809430577
-4.7983814963622
-4.93198889539142
-4.97072129594592
-4.88775205449138
-4.96917754667146
-4.972240408012
-4.96062137229138
-4.84390165131993
-4.93630849353535
-4.92623245728544
-4.91859094033325
-4.89568644535618
-4.87243553740634
-4.76982873302833
-4.8953404941385
-4.94451830002783
-4.88104841757604
-4.80303414573805
-4.88705883246573
-4.96499558513462
-4.56610914869673
-4.96928985131163
-4.80780803677881
-4.9556234540787
-4.84808934356167
-4.72319662154655
-4.9575854510567
-4.96960730728536
-4.9056755790436
-4.94039653820335
-4.53920246550341
-4.97211181130125
-4.86213634700864
-4.96802952189005
-4.9717135485154
-4.82056508210921
-4.96777645971916
-4.94038569046493
-4.95173085290477
-4.83470303172871
-4.91551379314551
-4.93963055552644
-4.97211181086369
-4.807583383435
-4.97216236251657
-4.97232745985347
-4.91551379314551
-4.94522084426514
-4.89719997383376
-4.96071975048121
-4.93464863469402
-4.88775205449138
-4.91638381844513
-4.80256598250479
-4.79828215315771
-4.73688107699373
-4.88114134915641
-4.92310502488463

数据集2:

-4.96600134256044
-4.78340374913002
-4.93136896680689
-4.80958108060998
-4.78688287192542
-4.9431452930913
-4.93676628405869
-4.87328189586985
-4.91867843591513
-4.72101863119006
-4.95749167305945
-4.79202404641664
-4.91265785779198
-4.94596580589554
-4.96595222256787
-4.7990191635208
-4.97194852291884
-4.78515347272161
-4.78340374913002
-4.8994168374135
-4.97206198058066
-4.95252689510477
-4.93963055552644
-4.95490836707013
-4.94133564424905
-4.78567577865158
-4.93963055552644
-4.93131563559386
-4.9710618452962
-4.90015209439797
-4.9665194453887
-4.93403567225855
-4.91165041153205
-4.85009602823937
-4.78340374913002
-4.77292978439906
-4.94782851444531
-4.64848347534667
-4.91165041153205
-4.82937702765807
-4.96202809430577
-4.7983814963622
-4.93198889539142
-4.97072129594592
-4.88775205449138
-4.96917754667146
-4.972240408012
-4.96062137229138
-4.84390165131993
-4.93630849353535
-4.92623245728544
-4.91859094033325
-4.89568644535618
-4.87243553740634
-4.76982873302833
-4.8953404941385
-4.94451830002783
-4.88104841757604
-4.80303414573805
-4.88705883246573
-4.96499558513462
-4.56610914869673
-4.96928985131163
-4.80780803677881
-4.9556234540787
-4.84808934356167
-4.72319662154655
-4.9575854510567
-4.96960730728536
-4.9056755790436
-4.94039653820335
-4.53920246550341
-4.97211181130125
-4.86213634700864
-4.96802952189005
-4.9717135485154
-4.82056508210921
-4.96777645971916
-4.94038569046493
-4.95173085290477
-4.83470303172871
-4.91551379314551
-4.93963055552644
-4.97211181086369
-4.807583383435
-4.97216236251657
-4.97232745985347
-4.91551379314551
-4.94522084426514
-4.89719997383376
-4.96071975048121
-4.93464863469402
-4.88775205449138
-4.91638381844513
-4.80256598250479
-4.79828215315771
-4.73688107699373
-4.88114134915641
-4.92310502488463

我试着把它们画成直方图,然后测量直方图之间的重叠,作为直方图总面积的百分比。我尝试使用suggested in this post方法,但结果是大于1的答案--我认为这是不可能的。
我的代码看起来像这样:

rng = min(dataset1.min(),dataset2.min()),max(dataset1.max(),dataset2.max())
n1, bins1,_= plt.hist(dataset1,color = color1,alpha = 0.75,bins=7,weights =(np.ones_like(dataset1)/len(dataset1)),range=rng)
n1_area = sum(np.diff(bins1)*n1)
n2, bins2,_ = plt.hist(dataset2,color = color2,alpha = 0.75,bins = 7,weights =(np.ones_like(dataset2)/len(dataset2)),range=rng)
n2_area = sum(np.diff(bins2)*n2)
overlap = np.minimum(n1,n2)
overlap_area = overlap.sum()
overlap_percentage=overlap_area/(n1_area+n2_area)

有谁知道为什么我得到的百分比大于1,以及如何修复它,以便我得到正确的值?

sxissh06

sxissh061#

看起来你用n1_area=sum(np.diff(bins1)*n1)计算了n1和n2的历史数据的真实的“面积”。但是overlap只是样本的计数。他们几乎是无与伦比的。
您可以对样本使用两种计数,即overlap.sum(),或同时使用“面积”,即sum(np.diff(bins1)*n1)。但不要混合它们。
为了更清楚,最后一个百分比应该计算为overlap / (n1 + n2 - overlap)。由于重叠的n1和n2的总面积为(n1 + n2 - overlap)Illustration

yacmzcpb

yacmzcpb2#

  • 一些代码将类似于How to plot the difference between two histograms,除了density将在np.histogram中使用。
  • 为了计算重叠,两个直方图的箱边缘必须相同。
  • np.histogram
  • density:bool,optional如果False,则结果将包含每个bin中的样本数。如果True,则结果是bin处的概率 * 密度 * 函数的值,归一化使得范围内的 * 积分 * 为1。注意,直方图值的和将不等于1,除非选择单位宽度的箱;它不是概率 * 质量 * 函数。
  • 在这种情况下,bin宽度为0.5,因此h1h2需要乘以0.5
  • 使用自定义正态样本,因为OP中的两个数据集完全相同。
    *python 3.11.3pandas 2.0.2matplotlib 3.7.1seaborn 0.12.2numpy 1.24.3中测试
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# sample datasets
np.random.seed(2023)
dataset1 = np.random.normal(loc=9, scale=1.5, size=100)
dataset2 = np.random.normal(loc=8, scale=0.6, size=100)

# create a long form dataframe for use with seaborn
df = pd.DataFrame({'ds1': dataset1, 'ds2': dataset2}).melt()

# calculate the density hist for each dataset with specified matching bin edges
h1, be1 = np.histogram(dataset1, bins=np.arange(4, 13.1, 0.5), density=True)
h2, be2 = np.histogram(dataset2, bins=np.arange(4, 13.1, 0.5), density=True)
plt.figure(figsize=(12, 4))
ax = sns.histplot(data=df, x='value', stat='density', hue='variable', multiple='dodge', bins=np.arange(4, 13.1, 0.5))
ax.set_xticks(be2)

ax.margins(x=0)

for c in ax.containers:
    _ = ax.bar_label(c, fontsize=8)

  • 通过为布尔值创建logical_and贴图来计算重叠百分比,其中h1h2不等于0。
# create a mask for where each data set is non-zero
m1 = h1 != 0
m2 = h2 != 0

# use a logical and to create a combined map where both datasets are non-zero
ol = np.logical_and(m1, m2)

# calculate the overlapping density, where 0.5 is the bin width
ol_density = np.abs((h1 - h2) * 0.5)[ol]

# calculate the total overlap percent
ol_perecent = ol_density.sum() * 100

ol_perecent → 71.00000000000001
  • sns.barplot绘制重叠区域的绝对值。
  • 将下一个图与前一个图进行比较,显示来自重叠数据的箱的条形图。
  • 条形值注解的总和等于ol_percent
# calculate the absolute difference for each bin
y = np.abs(h1 - h2) * 0.5

# set non-overlapping bins to 0
y[~ol] = 0

plt.figure(figsize=(12, 4))
ax = sns.barplot(y=y, x=be1[:-1], width=1, ec='k', color='purple', alpha=0.75)
_ = ax.set_xticks(ticks=np.arange(0, 18, 1)-0.5, labels=be1[:-1])

ax.margins(x=0, y=0.1)

for c in ax.containers:
    _ = ax.bar_label(c, fontsize=8, padding=3)

相关问题