Python中的Hosmer-Lemeshow拟合优度测试

8cdiaqws  于 2023-05-05  发布在  Python
关注(0)|答案(3)|浏览(327)

我在python中估计了一个glm。我该如何执行霍斯默-莱梅肖善良
在python中对这个模型进行拟合测试?

mitkmikd

mitkmikd1#

import matplotlib.pyplot as plt

import pandas as pd 
import numpy as np
from scipy.stats import chi2

# This could be made into a neat function of Hosmer-Lemeshow. 
def HosmerLemeshow (model,Y):
    pihat=model.predict()
    pihatcat=pd.cut(pihat, np.percentile(pihat,[0,25,50,75,100]),labels=False,include_lowest=True) #here we've chosen only 4 groups

    meanprobs =[0]*4 
    expevents =[0]*4
    obsevents =[0]*4 
    meanprobs2=[0]*4 
    expevents2=[0]*4
    obsevents2=[0]*4 

    for i in range(4):
       meanprobs[i]=np.mean(pihat[pihatcat==i])
       expevents[i]=np.sum(pihatcat==i)*np.array(meanprobs[i])
       obsevents[i]=np.sum(Y[pihatcat==i])
       meanprobs2[i]=np.mean(1-pihat[pihatcat==i])
       expevents2[i]=np.sum(pihatcat==i)*np.array(meanprobs2[i])
       obsevents2[i]=np.sum(1-Y[pihatcat==i]) 

    data1={'meanprobs':meanprobs,'meanprobs2':meanprobs2}
    data2={'expevents':expevents,'expevents2':expevents2}
    data3={'obsevents':obsevents,'obsevents2':obsevents2}
    m=pd.DataFrame(data1)
    e=pd.DataFrame(data2)
    o=pd.DataFrame(data3)
    
    # The statistic for the test, which follows, under the null hypothesis,
    # The chi-squared distribution with degrees of freedom equal to amount of groups - 2. Thus 4 - 2 = 2
    tt=sum(sum((np.array(o)-np.array(e))**2/np.array(e))) 
    pvalue=1-chi2.cdf(tt,2)

    return pd.DataFrame([[chi2.cdf(tt,2).round(2), pvalue.round(2)]],columns = ["Chi2", "p - value"])
    
HosmerLemeshow(glm_full,Y)
sf6xfgos

sf6xfgos2#

我找到了一种方法,代码不是最好的质量,但它的工作原理:

import pandas as pd
import numpy as np
from scipy.stats import chi2
pihat=model.predict()
pihatcat=pd.cut(pihat, np.percentile(pihat,[0,25,50,75,100]),labels=False,include_lowest=True) #here I've chosen only 4 groups

meanprobs =[0]*4 
expevents =[0]*4
obsevents =[0]*4 
meanprobs2=[0]*4 
expevents2=[0]*4
obsevents2=[0]*4 

for i in range(4):
   meanprobs[i]=np.mean(pihat[pihatcat==i])
   expevents[i]=np.sum(pihatcat==i)*np.array(meanprobs[i])
   obsevents[i]=np.sum(data.r[pihatcat==i])
   meanprobs2[i]=np.mean(1-pihat[pihatcat==i])
   expevents2[i]=np.sum(pihatcat==i)*np.array(meanprobs2[i])
   obsevents2[i]=np.sum(1-data.r[pihatcat==i]) 

data1={'meanprobs':meanprobs,'meanprobs2':meanprobs2}
data2={'expevents':expevents,'expevents2':expevents2}
data3={'obsevents':obsevents,'obsevents2':obsevents2}
m=pd.DataFrame(data1)
e=pd.DataFrame(data2)
o=pd.DataFrame(data3)

tt=sum(sum((np.array(o)-np.array(e))**2/np.array(e))) #the statistic for the test, which follows,under the null hypothesis, the chi-squared distribution with degrees of freedom equal to amount of groups - 2 
pvalue=1-chi2.cdf(tt,2)
pvalue
j5fpnvbx

j5fpnvbx3#

Hosmer-Lemeshow是一个仅在响应变量为二进制时应用的测试。我这样说是因为简单线性回归和泊松回归属于广义线性模型,但其中的响应变量不是二进制的。
我没有在Python中找到任何函数来应用Hosmer-Lemeshow测试,所以我将用传统的方法来做这个测试(计算测试统计量的值,然后计算p值)。
作为检验统计量,我们将使用此函数

其中:
O_{1g}是在第g个十分位组中观察到的Y = 1个事件
O_{0 g}是第g个十分位数组中观察到的Y = 0事件
E_{1g}是第g个十分位数组中的预期Y = 1事件
E_{0 g}是第g个十分位数组中的预期Y = 0事件
另外,我们应该记住,该检验统计量渐近地遵循X^2_{(G - 2)}
来源
为了做这个测试我需要这些导入

import numpy as np

import pandas as pd

from scipy.stats import chi2 #In order to be able to calculate the pvalue

让我们假设在你的代码中有一个因变量Y,它是一个Series对象,它保存了我们观察到的二进制值(0或1)。此外,假设代码中有一个变量X,它是一个 Dataframe ,包含模型的所有自变量。

#We will calculate the the predicted probabilities of success (Y = 1) for every data via the model
we have created

#Let's suppose that you have named your model log_reg

predictions = log_reg(X)

#Now we will create a dataframe with two columns. The first column will represent the predicted
#probabilies and the second column will represent the binary data we observed

hl_df = pd.DataFrame({

"P_i": predictions,

"Outcome": Y

})

为了进行Hosmer-Lemeshow检验,我们必须将包含预测概率(在我们的情况下为P_i)的变量离散化到基于样本分位数的相等大小的桶中。在Hosmer-Lemeshow测试中,我们通常创建10个十分位数,但您可以使用例如4个分位数。你决定吧
为此,我们将使用pd.qcut()函数。

hl_df["decile"] = pd.qcut(hl_df["P_i"],10)

现在,我们将计算$O_{0 g}$和$O_{1g}$

#We will calculate all the observed ones in every decile

obsevents_1 = hl_df["Outcome"].groupby(hl_df.decile).sum()

#We will find all the observed zeroes of every decile if we substract the obsevents_1 from the
#number of elements in every decile

obsevents_0 = hl_df["Outcome"].groupby(hl_df.decile).count() - obsevents_1

现在我们将计算$E_{0 g}$ $E_{1g}$
正如我们所知,变量Y是一个二进制变量,这意味着它遵循Bernoulli($p_i$)。因此,变量Y的期望值是$p_i$。此外,我们应该记住,Y的每个值都是相互独立的。因此,在第g个十分位数中Y = 1的预期结果数为p_1 + p_2 + p_3 +…其中n是属于第g个十分位数的p_i的数目。

expevents_1 = hl_df["P_i"].groupby(hl_df.decile).sum()

#We will find the expected number of events Y = 0 for every decile by substracting the
#expevents_1 from the number of elements in every decile

expevents_0 = hl_df["P_i"].groupby(hl_df.decile).count() - expevents_1

现在,我们准备计算检验统计量的值

hl = (((obsevents_0 - expevents_0)**2)/(expevents_0)).sum() + (((obsevents_1 - expevents_1)**2)/(expevents_1)).sum()

现在,我们将计算pvalue

pvalue = 1 - chi2.cdf(hl , 10 - 2)

#if you choose 4 quartiles instead of 10 deciles you should write pvalue = 1 - chi2.cdf(hl,4 - 2)

现在,根据p值和要应用的显著性水平,您可以决定是否拒绝零假设。
我想提一下,上面的代码我使用了this源代码。

相关问题