python—有没有一种方法可以获得某个列上的日期差

mwngjboj  于 2021-07-13  发布在  Spark
关注(0)|答案(2)|浏览(274)

我要计算从订单到到达状态所用的每个唯一名称的时间差/日期差。
输入Dataframe如下

+------------------------------+
| Date      | id | name |staus
+------------------------------+
| 1986/10/15| A  |john |order
| 1986/10/16| A  |john |dispatched
| 1986/10/18| A  |john |arrived
| 1986/10/15| B  |peter|order
| 1986/10/16| B  |peter|dispatched
| 1986/10/17| B  |peter|arrived
| 1986/10/16| C  |raul |order
| 1986/10/17| C  |raul |dispatched
| 1986/10/18| C  |raul |arrived
+-----------------------------+

预期的输出数据集应该类似于

+---------------------------------------------------+
| id | name |time_difference_from_order_to_delivered|
+---------------------------------------------------+
 A   | john |                3days
 B   |peter |                2days
 C   | Raul |                2days
+---------------------------------------------------+
I am stuck on what logic to implement
oknrviil

oknrviil1#

您可以按分组并使用条件聚合计算日期差异:

import pyspark.sql.functions as F

df2 = df.groupBy('id', 'name').agg(
    F.datediff(
        F.to_date(F.max(F.when(F.col('staus') == 'arrived', F.col('Date'))), 'yyyy/MM/dd'),
        F.to_date(F.min(F.when(F.col('staus') == 'order', F.col('Date'))), 'yyyy/MM/dd')
    ).alias('time_diff')
)

df2.show()
+---+-----+---------+
| id| name|time_diff|
+---+-----+---------+
|  A| john|        3|
|  C| raul|        2|
|  B|peter|        2|
+---+-----+---------+

您还可以直接减去日期,这将返回一个间隔类型列:

import pyspark.sql.functions as F

df2 = df.groupBy('id', 'name').agg(
    (
        F.to_date(F.max(F.when(F.col('staus') == 'arrived', F.col('Date'))), 'yyyy/MM/dd') -
        F.to_date(F.min(F.when(F.col('staus') == 'order', F.col('Date'))), 'yyyy/MM/dd')
    ).alias('time_diff')
)

df2.show()
+---+-----+---------+
| id| name|time_diff|
+---+-----+---------+
|  A| john|   3 days|
|  C| raul|   2 days|
|  B|peter|   2 days|
+---+-----+---------+
rkue9o1l

rkue9o1l2#

假设ordered是最早的日期,delivered是最后的日期,只需使用聚合和 datediff() :

select id, name, datediff(max(date), min(date)) as num_days
from t
group by id, name;

为了更精确,可以使用条件聚合:

select id, name,
       datediff(max(case when status = 'arrived' then date end)
                min(case when status = 'order' then date end)
               ) as num_days
from t
group by id, name;

相关问题