我有一个PySpark Dataframe,有一个名为timestamp
的列,其中包含一个时间戳数组,而不是固定长度的数组,例如:
+------------------------------------------------------------------------------------------------------------------------------------+
|timestamp |
+------------------------------------------------------------------------------------------------------------------------------------+
|[[2022-01-01 12:00:00, 2022-01-02 15:30:00]] |
|[[2022-01-01 12:00:00, 2022-01-02 14:30:00], [2022-01-02 12:00:00, 2022-01-03 19:30:00], [2022-01-02 12:00:00, 2022-01-03 15:30:00]]|
|[[2022-01-01 12:00:00, 2022-01-02 16:30:00], [2022-01-03 12:00:00, 2022-01-04 17:30:00]] |
|[] |
+------------------------------------------------------------------------------------------------------------------------------------+
我正在尝试获取每个包含时间戳的数组的时间差(以秒为单位)。
因此输出将是:
+------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
|timestamp | time_diff |
+------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
|[[2022-01-01 12:00:00, 2022-01-02 15:30:00]] |[99000.0] |
|[[2022-01-01 12:00:00, 2022-01-02 14:30:00], [2022-01-02 12:00:00, 2022-01-03 19:30:00], [2022-01-02 12:00:00, 2022-01-03 15:30:00]]|[95400.0,199800.0,185400.0] |
|[[2022-01-01 12:00:00, 2022-01-02 16:30:00], [2022-01-03 12:00:00, 2022-01-04 17:30:00]] |[189000.0,192600.0] |
|[] |[]
+------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
因此,列表中的每个值都包含时间差。
简而言之,我想这样做:
+------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
|timestamp | time_diff |
+------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
|[[t1,t2]] |[(t2-t1)] |
|[[t3,t4], [t5,t6], [t7,t8]] |[(t4-t3),(t6-t5),(t8-t7)] |
|[[t9,t10], [t11,t12]] |[(t10-t9),(t12-t11)] |
|[] |[] |
+------------------------------------------------------------------------------------------------------------------------------------+------------------------------+
这里t1,t2...tn是数组中的时间戳
注意:我正在使用spark 3.x
先谢谢你了。
1条答案
按热度按时间3duebb1j1#
可以使用
transform
将外部数组的每个元素Map到内部数组的时间差,如下所示: