scipy 如何在多个pandas列上运行t-test

ukxgm1gy  于 12个月前  发布在  其他
关注(0)|答案(1)|浏览(154)

我想写一段代码(用几行),同时在ProductPurchase_costwarranty_yearsservice_cost上运行t检验。

# dataset 

import pandas as pd
from scipy.stats import ttest_ind

data = {'Product': ['laptop', 'printer','printer','printer','laptop','printer','laptop','laptop','printer','printer'],
        'Purchase_cost': [120.09, 150.45, 300.12, 450.11, 200.55,175.89,124.12,113.12,143.33,375.65],
        'Warranty_years':[3,2,2,1,4,1,2,3,1,2],
        'service_cost': [5,5,10,4,7,10,4,6,12,3]
    
        }

df = pd.DataFrame(data)

print(df)

字符串
代码尝试为ProductPurchase_cost。我想运行t-测试为Productwarranty_yearsProductservice cost

#define samples
group1 = df[df['Product']=='laptop']
group2 = df[df['Product']=='printer']

#perform independent two sample t-test
ttest_ind(group1['Purchase_cost'], group2['Purchase_cost'])

wlp8pajw

wlp8pajw1#

ttest_ind可以在2D(ND)输入上工作:

cols = df.columns.difference(['Product'])
# or with an explicit list
# cols = ['Purchase_cost', 'Warranty_years', 'service_cost']

group1 = df[df['Product']=='laptop']
group2 = df[df['Product']=='printer']
out = pd.DataFrame(ttest_ind(group1[cols], group2[cols]),
                   columns=cols, index=['statistic', 'pvalue'])

字符串
如果不是,你可以使用一个字典理解循环你的列:

out = pd.DataFrame({c: ttest_ind(group1[c], group2[c]) for c in cols},
                    index=['statistic', 'pvalue'])


输出量:

Purchase_cost  Warranty_years  service_cost
statistic      -1.861113        3.513240     -0.919464
pvalue          0.099760        0.007924      0.384738

泛化到更多对

如果您的产品不仅仅是笔记本电脑/打印机,并且希望比较所有配对,您可以概括为:

from itertools import combinations

cols = df.columns.difference(['Product'])

g = df.groupby('Product')[cols]

out = pd.concat({(a,b): pd.DataFrame(ttest_ind(g.get_group(a), g.get_group(b)),
                                     columns=cols, index=['statistic', 'pvalue'])
                 for a, b in combinations(df['Product'].unique(), 2)
                }, names=['product1', 'product2'])


带有额外类别的输出示例(电话):

Purchase_cost  Warranty_years  service_cost
product1 product2                                                       
laptop   printer  statistic      -1.861113        3.513240     -0.919464
                  pvalue          0.099760        0.007924      0.384738
         phone    statistic      -1.945836        2.988072      2.766417
                  pvalue          0.109251        0.030515      0.039533
printer  phone    statistic      -1.286968        0.423659      1.893370
                  pvalue          0.239026        0.684528      0.100178

  • 如果您有许多组合,请注意,您可能应该对数据进行后处理以考虑multiple testing

相关问题