我有一个只有一行的PySpark数据框:
spark_session_tbl_df.printSchema()
spark_session_tbl_df.show()
root
|-- strm: string (nullable = true)
|-- acad_career: string (nullable = true)
|-- session_code: string (nullable = true)
|-- sess_begin_dt: timestamp (nullable = true)
|-- sess_end_dt: timestamp (nullable = true)
|-- census_dt: timestamp (nullable = true)
+----+-----------+------------+-------------------+-------------------+-------------------+
|strm|acad_career|session_code| sess_begin_dt| sess_end_dt| census_dt|
+----+-----------+------------+-------------------+-------------------+-------------------+
|2228| UGRD| 1|2022-08-20 00:00:00|2022-12-03 00:00:00|2022-09-19 00:00:00|
+----+-----------+------------+-------------------+-------------------+-------------------+
我正尝试输出如下内容,其中每行是一个7天的范围/序列:
+-------------------+-------------------+
| sess_begin_dt| sess_end_dt|
+-------------------+-------------------+
|2022-08-20 |2022-08-27 |
+-------------------+-------------------+
|2022-08-28 |2022-09-04 |
+----+--------------+-------------------+
|2022-09-05 |2022-09-12 |
+-------------------+-------------------+
|2022-09-13 |2022-09-20 |
+----+--------------+-------------------+
|2022-09-21 |2022-09-28 |
+-------------------+-------------------+
.....
+-------------------+-------------------+
|2022-11-26 |2022-12-03 |
+----+--------------+-------------------+
我在下面尝试了这个方法,但是我不确定它是否可以引用PySpark Dataframe ,或者我需要用另一种方法来实现上面的期望输出。
from pyspark.sql.functions import sequence, to_date, explode, col
date_range_df = spark.sql("SELECT sequence(to_date('sess_begin_dt'), to_date('sess_end_dt'), interval 7 day) as date").withColumn("date", explode(col("date")))
date_range_df.show()
1条答案
按热度按时间omhiaaxx1#
处理时间序列的方法之一是将日期转换为时间戳,并以数值方式解决问题,最后再将其转换为日期。