将sas proc sql转换为python(pandas)

js4nwp54  于 2021-07-26  发布在  Java
关注(0)|答案(2)|浏览(591)

我使用pandas库将sas中的一些代码重写为python。
我有这样的代码,我不知道该怎么处理?
你能帮帮我吗?因为太复杂了,我做不好。我已更改列的名称(用于加密敏感数据)
这是sas代码:

proc sql;
   create table &work_lib..opk_do_inf_4 as 
   select distinct         
            *,
            min(kat_opk) as opk_do_inf,
            count(nr_ks) as ilsc_opk_do_kosztu_infr
from &work_lib..opk_do_inf_3
group by kod_ow, kod_sw, nr_ks, nr_ks_pr, nazwa_zabiegu_icd_9, nazwa_zabiegu
having kat_opk = opk_do_inf
;
quit;

这是我对Pandas的尝试:

df = self.opk_do_inf_3() -> create DF using other function
df['opk_do_inf'] = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu'])['kat_opk'].min()
df['ilsc_opk_do_kosztu_infr'] = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu'])['nr_ks'].count()
df_groupby = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu']).filter(lambda x: x['kat_opk']==x['opk_do_inf'])
df = df_groupby.reset_index()
df = df.drop_duplcates()
return df
kmynzznz

kmynzznz1#

首先,打电话 SELECT * 总的来说 GROUP BY 查询不是有效的sql。sas可能允许,但可能会产生未知的结果。通常 SELECT 列应限于中的列 GROUP BY 条款。
也就是说,聚合sql查询通常可以用 groupby.agg() 操作 WHERE (聚合前过滤)或 HAVING (聚合后过滤)使用 .loc 或者 query .
sql语句

SELECT col1, col2, col3, 
       MIN(col1) AS min_col1,
       AVG(col2) AS mean_col2, 
       MAX(col3) AS max_col3, 
       COUNT(*)  AS count_obs

FROM mydata
GROUP BY col1, col2, col3
HAVING col1 = min(col1)

Pandas
一般的

agg_data = (mydata.groupby(["col1", "col2", "col3"], as_index=False)
                  .agg(min_col1 = ("col1", "min"),
                       mean_col2 = ("col2", "mean"),
                       max_col3 = ("col3", "max"),
                       count_obs = ("col1", "count"))
                  .query("col1 == min_col1")
           )

具体的

opk_do_inf_4 = (mydata.groupby(["kat_opk", "kod_ow", "kod_sw", "nr_ks", "nr_ks_pr", 
                                "nazwa_zabiegu_icd_9", "nazwa_zabiegu"], 
                                as_index=False)
                      .agg(opk_do_inf = ("kat_opk", "min"),
                           ilsc_opk_do_kosztu_infr = ("nr_ks", "count"))
                      .query("kat_opk == opk_do_inf")
               )
8gsdolmq

8gsdolmq2#

您可以使用pandasql包中的sqldf函数在dataframe上运行sql查询。下面的例子
'''from pandasql import sqldf query=“select top 10*from df”newdf=sqldf(query,locals())''

相关问题