PySpark:查找PySpark Dataframe 中时间戳列表中两个时间戳之间的时间差

iezvtpos  于 2023-04-05  发布在  Spark
关注(0)|答案(1)|浏览(195)

我有一个PySpark Dataframe,有一个名为timestamp的列,其中包含一个时间戳数组,而不是固定长度的数组,例如:

+------------------------------------------------------------------------------------------------------------------------------------+
|timestamp                                                                                                                           |
+------------------------------------------------------------------------------------------------------------------------------------+
|[[2022-01-01 12:00:00, 2022-01-02 15:30:00]]                                                                                        |
|[[2022-01-01 12:00:00, 2022-01-02 14:30:00], [2022-01-02 12:00:00, 2022-01-03 19:30:00], [2022-01-02 12:00:00, 2022-01-03 15:30:00]]|
|[[2022-01-01 12:00:00, 2022-01-02 16:30:00], [2022-01-03 12:00:00, 2022-01-04 17:30:00]]                                            |
|[]                                                                                                                                  |
+------------------------------------------------------------------------------------------------------------------------------------+

我正在尝试获取每个包含时间戳的数组的时间差(以秒为单位)。
因此输出将是:

+------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
|timestamp                                                                                                                           |    time_diff                 |
+------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
|[[2022-01-01 12:00:00, 2022-01-02 15:30:00]]                                                                                        |[99000.0]                     |
|[[2022-01-01 12:00:00, 2022-01-02 14:30:00], [2022-01-02 12:00:00, 2022-01-03 19:30:00], [2022-01-02 12:00:00, 2022-01-03 15:30:00]]|[95400.0,199800.0,185400.0]   |
|[[2022-01-01 12:00:00, 2022-01-02 16:30:00], [2022-01-03 12:00:00, 2022-01-04 17:30:00]]                                            |[189000.0,192600.0]           |
|[]                                                                                                                                  |[]
+------------------------------------------------------------------------------------------------------------------------------------+------------------------------+

因此,列表中的每个值都包含时间差。
简而言之,我想这样做:

+------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
|timestamp                                                                                                                           |    time_diff                 |
+------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
|[[t1,t2]]                                                                                                                           |[(t2-t1)]                     |
|[[t3,t4], [t5,t6], [t7,t8]]                                                                                                         |[(t4-t3),(t6-t5),(t8-t7)]     |
|[[t9,t10], [t11,t12]]                                                                                                               |[(t10-t9),(t12-t11)]          |
|[]                                                                                                                                  |[]                            |
+------------------------------------------------------------------------------------------------------------------------------------+------------------------------+

这里t1,t2...tn是数组中的时间戳
注意:我正在使用spark 3.x
先谢谢你了。

3duebb1j

3duebb1j1#

可以使用transform将外部数组的每个元素Map到内部数组的时间差,如下所示:

result = df.withColumn("time_diff",
    F.transform(F.col("array"), lambda x: (x.getItem(1) - x.getItem(0)).cast("long")))

相关问题