我想帮助折叠的时间间隔,重叠在每个小组内。具体来说,这就是我所拥有的:开始时间18:009:0018:309:3019:4510:0028:009:0028:308:40这就是我想要的:开始时间18:009:3019:4510:0028:009:00数据非常大,需要作为sparkDataframe来处理。谢谢你的帮助!
jvlzgdj91#
您可以添加分组列,如下所示:
from pyspark.sql import functions as F, Window df2 = df.withColumn( 'time_start', F.lpad('time_start', 5, '0') ).withColumn( 'time_end', F.lpad('time_end', 5, '0') ).withColumn( 'overlap', F.when( F.max('time_end').over( Window.partitionBy('id') .orderBy('time_start') .rowsBetween(Window.unboundedPreceding, -1) ) >= F.col('time_start'), 0 ).otherwise(1) ).withColumn( 'group', F.sum('overlap').over(Window.partitionBy('id').orderBy('time_start')) ).groupBy('id', 'group').agg( F.min('time_start').alias('time_start'), F.max('time_end').alias('time_end') ).drop('group') df2.show() +---+----------+--------+ | id|time_start|time_end| +---+----------+--------+ | 1| 08:00| 09:30| | 1| 09:45| 10:00| | 2| 08:00| 09:00| +---+----------+--------+
分组前的幕后:
+---+----------+--------+-------+-----+ | id|time_start|time_end|overlap|group| +---+----------+--------+-------+-----+ | 1| 08:00| 09:00| 1| 1| | 1| 08:30| 09:30| 0| 1| | 1| 09:45| 10:00| 1| 2| | 2| 08:00| 09:00| 1| 1| | 2| 08:30| 08:40| 0| 1| +---+----------+--------+-------+-----+
1条答案
按热度按时间jvlzgdj91#
您可以添加分组列,如下所示:
分组前的幕后: