将PartitionBy查询上的SQL滞后转换为Pandas形

nukf8bse  于 2022-12-21  发布在  其他
关注(0)|答案(2)|浏览(126)

我有一个包含一些顾客购买数据的表。我想知道顾客在商店的进入时间和离开时间&我已经为此编写了下面的SQL查询。我如何将其转换为python panda?

SELECT MyTable.*, 
                   LAG(EventTypeID, 1, 0) 
                      OVER ( PARTITION BY ID,Name
                         ORDER BY Time) AS LastEvent , 
                   LEAD(EventTypeID, 1, 0) 
                      OVER ( PARTITION BY ID,Name
                         ORDER BY Time) AS NextEvent 
                FROM DL.dbo.DataTable MyTable

输入:

+-------------+--------+--------+-------+
| EventTypeID |   ID   |  Name  | Time  |
+-------------+--------+--------+-------+
|           1 | QWERTY | Joseph | 10.20 |
|           1 | QWERTY | Joseph | 10.25 |
+-------------+--------+--------+-------+

预期结果:

+-------------+--------+--------+-------+-----------+-----------+
| EventTypeID |   ID   |  Name  | Time  | LastEvent | NextEvent |
+-------------+--------+--------+-------+-----------+-----------+
|      1      | QWERTY | Joseph | 10.20 |         0 |         1 |
|      1      | QWERTY | Joseph | 10.25 |         1 |         0 |
+-------------+--------+--------+-------+-----------+-----------+
ifmq2ha2

ifmq2ha21#

df['LastEvent'] = df.sort_values(by=['Time'], ascending=True)\
                       .groupby(['ID','Name'])['EventTypeID'].shift(1)

df['NextEvent'] = df.sort_values(by=['Time'], ascending=True)\
                       .groupby(['ID','Name'])['EventTypeID'].shift(-1)

感谢Lev Gelman提供的指导。上面的代码就可以做到这一点!

2ul0zpep

2ul0zpep2#

我只是在一个更复杂的情况下才发现这一点。我们必须在这里使用df.join(),因为shift()方法会移动 Dataframe 的索引,而.join()会对索引起作用。出于某种原因,我还没有弄清楚,在.join()中使用'on'操作符确实会把事情搞砸。

#### Experiment - create a dataframe with two groups, shift by 1 to lag, join to get lags
obsgrp1 = ['Ring','Ring','Ring','Ring','Ring','Ring','Arlo','Arlo','Arlo','Arlo','Arlo','Arlo','Arlo','Arlo','Arlo']
obsgrp2 = ['Door','Door','Side','Door','Side','Side','Roof','Window','Window','Window','Roof','Roof','Roof','Window','Roof']
obsdate = ['2022-10-03','2022-10-05','2022-10-06','2022-09-30','2022-10-12','2022-10-11','2022-10-10'
    ,'2022-10-02','2022-10-06','2022-10-07','2022-10-08','2022-10-13','2022-10-14','2022-10-15','2022-10-18','2022-10-20','2022-10-21','2022-10-22','2022-10-25','2022-10-30']
obsval = [37,41,16,22,33,31,40,93,88,89,99,92,97,94,91,92,93,90,88,87]

_grp = ['obsGroup1','obsGroup2']
_dat = ['obsDate','obsVal']

df = pd.DataFrame(zip(obsgrp1,obsgrp2,obsdate,obsval), columns=['obsGroup1','obsGroup2','obsDate','obsVal'])\
    .sort_values(by=['obsGroup1','obsGroup2','obsDate'], ascending=[True,False,False])
print(df.head(15))

# Shift 
sdf = df.groupby(_grp)[_dat].shift(-1)
print(sdf.head(15))

# Check
print(df.index)
print(sdf.index)

# Join
jdf = df.join(sdf, rsuffix = '_prev', how='left')
jdf.head(15)

相关问题