我有两组数据:
数据集1:
-4.96600134256044
-4.78340374913002
-4.93136896680689
-4.80958108060998
-4.78688287192542
-4.9431452930913
-4.93676628405869
-4.87328189586985
-4.91867843591513
-4.72101863119006
-4.95749167305945
-4.79202404641664
-4.91265785779198
-4.94596580589554
-4.96595222256787
-4.7990191635208
-4.97194852291884
-4.78515347272161
-4.78340374913002
-4.8994168374135
-4.97206198058066
-4.95252689510477
-4.93963055552644
-4.95490836707013
-4.94133564424905
-4.78567577865158
-4.93963055552644
-4.93131563559386
-4.9710618452962
-4.90015209439797
-4.9665194453887
-4.93403567225855
-4.91165041153205
-4.85009602823937
-4.78340374913002
-4.77292978439906
-4.94782851444531
-4.64848347534667
-4.91165041153205
-4.82937702765807
-4.96202809430577
-4.7983814963622
-4.93198889539142
-4.97072129594592
-4.88775205449138
-4.96917754667146
-4.972240408012
-4.96062137229138
-4.84390165131993
-4.93630849353535
-4.92623245728544
-4.91859094033325
-4.89568644535618
-4.87243553740634
-4.76982873302833
-4.8953404941385
-4.94451830002783
-4.88104841757604
-4.80303414573805
-4.88705883246573
-4.96499558513462
-4.56610914869673
-4.96928985131163
-4.80780803677881
-4.9556234540787
-4.84808934356167
-4.72319662154655
-4.9575854510567
-4.96960730728536
-4.9056755790436
-4.94039653820335
-4.53920246550341
-4.97211181130125
-4.86213634700864
-4.96802952189005
-4.9717135485154
-4.82056508210921
-4.96777645971916
-4.94038569046493
-4.95173085290477
-4.83470303172871
-4.91551379314551
-4.93963055552644
-4.97211181086369
-4.807583383435
-4.97216236251657
-4.97232745985347
-4.91551379314551
-4.94522084426514
-4.89719997383376
-4.96071975048121
-4.93464863469402
-4.88775205449138
-4.91638381844513
-4.80256598250479
-4.79828215315771
-4.73688107699373
-4.88114134915641
-4.92310502488463
数据集2:
-4.96600134256044
-4.78340374913002
-4.93136896680689
-4.80958108060998
-4.78688287192542
-4.9431452930913
-4.93676628405869
-4.87328189586985
-4.91867843591513
-4.72101863119006
-4.95749167305945
-4.79202404641664
-4.91265785779198
-4.94596580589554
-4.96595222256787
-4.7990191635208
-4.97194852291884
-4.78515347272161
-4.78340374913002
-4.8994168374135
-4.97206198058066
-4.95252689510477
-4.93963055552644
-4.95490836707013
-4.94133564424905
-4.78567577865158
-4.93963055552644
-4.93131563559386
-4.9710618452962
-4.90015209439797
-4.9665194453887
-4.93403567225855
-4.91165041153205
-4.85009602823937
-4.78340374913002
-4.77292978439906
-4.94782851444531
-4.64848347534667
-4.91165041153205
-4.82937702765807
-4.96202809430577
-4.7983814963622
-4.93198889539142
-4.97072129594592
-4.88775205449138
-4.96917754667146
-4.972240408012
-4.96062137229138
-4.84390165131993
-4.93630849353535
-4.92623245728544
-4.91859094033325
-4.89568644535618
-4.87243553740634
-4.76982873302833
-4.8953404941385
-4.94451830002783
-4.88104841757604
-4.80303414573805
-4.88705883246573
-4.96499558513462
-4.56610914869673
-4.96928985131163
-4.80780803677881
-4.9556234540787
-4.84808934356167
-4.72319662154655
-4.9575854510567
-4.96960730728536
-4.9056755790436
-4.94039653820335
-4.53920246550341
-4.97211181130125
-4.86213634700864
-4.96802952189005
-4.9717135485154
-4.82056508210921
-4.96777645971916
-4.94038569046493
-4.95173085290477
-4.83470303172871
-4.91551379314551
-4.93963055552644
-4.97211181086369
-4.807583383435
-4.97216236251657
-4.97232745985347
-4.91551379314551
-4.94522084426514
-4.89719997383376
-4.96071975048121
-4.93464863469402
-4.88775205449138
-4.91638381844513
-4.80256598250479
-4.79828215315771
-4.73688107699373
-4.88114134915641
-4.92310502488463
我试着把它们画成直方图,然后测量直方图之间的重叠,作为直方图总面积的百分比。我尝试使用suggested in this post方法,但结果是大于1的答案--我认为这是不可能的。
我的代码看起来像这样:
rng = min(dataset1.min(),dataset2.min()),max(dataset1.max(),dataset2.max())
n1, bins1,_= plt.hist(dataset1,color = color1,alpha = 0.75,bins=7,weights =(np.ones_like(dataset1)/len(dataset1)),range=rng)
n1_area = sum(np.diff(bins1)*n1)
n2, bins2,_ = plt.hist(dataset2,color = color2,alpha = 0.75,bins = 7,weights =(np.ones_like(dataset2)/len(dataset2)),range=rng)
n2_area = sum(np.diff(bins2)*n2)
overlap = np.minimum(n1,n2)
overlap_area = overlap.sum()
overlap_percentage=overlap_area/(n1_area+n2_area)
有谁知道为什么我得到的百分比大于1,以及如何修复它,以便我得到正确的值?
2条答案
按热度按时间sxissh061#
看起来你用
n1_area=sum(np.diff(bins1)*n1)
计算了n1和n2的历史数据的真实的“面积”。但是overlap
只是样本的计数。他们几乎是无与伦比的。您可以对样本使用两种计数,即
overlap.sum()
,或同时使用“面积”,即sum(np.diff(bins1)*n1)
。但不要混合它们。为了更清楚,最后一个百分比应该计算为
overlap / (n1 + n2 - overlap)
。由于重叠的n1和n2的总面积为(n1 + n2 - overlap)
。Illustrationyacmzcpb2#
density
将在np.histogram
中使用。np.histogram
False
,则结果将包含每个bin中的样本数。如果True
,则结果是bin处的概率 * 密度 * 函数的值,归一化使得范围内的 * 积分 * 为1。注意,直方图值的和将不等于1,除非选择单位宽度的箱;它不是概率 * 质量 * 函数。0.5
,因此h1
和h2
需要乘以0.5
。*在
python 3.11.3
、pandas 2.0.2
、matplotlib 3.7.1
、seaborn 0.12.2
、numpy 1.24.3
中测试sns.histplot
绘制两个数据集,看看哪里有重叠的bin。.bar_label
的详细说明,请参阅How to add value labels on a bar chart。h1 * 0.5
和h2 * 0.5
的值匹配logical_and
贴图来计算重叠百分比,其中h1
和h2
不等于0。sns.barplot
绘制重叠区域的绝对值。ol_percent
。