使用spark 3.3.2转换带有毫秒的datetime字符串需要一个强制的点

i34xakig  于 2023-06-24  发布在  Apache
关注(0)|答案(1)|浏览(190)

我的数据集中有这个datetime字符串:'2023061218154258',我想用下面的代码将其转换为日期时间。然而,我希望工作的格式,不工作,即:yyyyMMddHHmmssSS。此代码将重现问题:

from pyspark.sql.functions import *
spark.conf.set("spark.sql.legacy.timeParserPolicy","CORRECTED")
# If the config is set to CORRECTED then the conversion will return null instead of throwing an exception.

df=spark.createDataFrame(
         data=[ ("1",  "2023061218154258")
                , ("2", "20230612181542.58")]
        ,schema=["id","input_timestamp"])
df.printSchema()

#Timestamp String to DateType
1. df.withColumn("timestamp",to_timestamp("input_timestamp", format = 'yyyyMMddHHmmssSS')).show(truncate=False)
df.withColumn("timestamp",to_timestamp("input_timestamp", format = 'yyyyMMddHHmmss.SS')).show(truncate=False)

输出:

+---+-----------------+---------+
|id |input_timestamp  |timestamp|
+---+-----------------+---------+
|1  |2023061218154258 |null     |
|2  |20230612181542.58|null     |
+---+-----------------+---------+

+---+-----------------+----------------------+
|id |input_timestamp  |timestamp             |
+---+-----------------+----------------------+
|1  |2023061218154258 |null                  |
|2  |20230612181542.58|2023-06-12 18:15:42.58|
+---+-----------------+----------------------+

我尝试使用yyyyMMddHHmmssSS格式的_timestamp,我期望它将字符串2023061218154258转换为时间戳2023-06-12 18:15:42.58

2g32fytz

2g32fytz1#

您遇到的问题是由于PySpark中to_timestamp函数的限制。to_timestamp函数期望时间戳格式符合Java SimpleDateFormat标准,该标准不支持超过毫秒(SSS)的亚秒精度。
在您的示例中,格式'yyyyMMddHHmmssSS'将不起作用,因为它期望秒级部分正好有两位数字。为了克服这个限制,您可以手动解析字符串,并使用PySpark中可用的其他函数将其转换为时间戳。
下面是一个示例,你可以通过从字符串中提取不同的成分,然后使用concat函数创建时间戳来实现所需的转换:

from pyspark.sql.functions import *

df.withColumn("year", substring("input_timestamp", 1, 4)) \
  .withColumn("month", substring("input_timestamp", 5, 2)) \
  .withColumn("day", substring("input_timestamp", 7, 2)) \
  .withColumn("hour", substring("input_timestamp", 9, 2)) \
  .withColumn("minute", substring("input_timestamp", 11, 2)) \
  .withColumn("second", substring("input_timestamp", 13, 2)) \
  .withColumn("subsecond", substring("input_timestamp", 15, 2)) \
  .withColumn("timestamp", concat(col("year"), lit("-"), col("month"), lit("-"), col("day"),
                                  lit(" "), col("hour"), lit(":"), col("minute"), lit(":"),
                                  col("second"), lit("."), col("subsecond"))) \
  .withColumn("timestamp", to_timestamp("timestamp")) \
  .show(truncate=False)

这段代码从input_timestamp列中提取各个组成部分(年、月、日、小时、分钟、秒、亚秒),并用适当的分隔符将它们连接起来,形成一个时间戳字符串。然后,应用to_timestamp函数将结果字符串转换为时间戳。
输出应如下所示:

+---+-----------------+----+-----+---+----+------+---------+----------------------+
|id |input_timestamp  |year|month|day|hour|minute|second   |subsecond|timestamp             |
+---+-----------------+----+-----+---+----+------+---------+----------------------+
|1  |2023061218154258|2023|06   |12 |18  |15    |42       |58       |2023-06-12 18:15:42.58|
|2  |20230612181542.58|2023|06   |12 |18  |15    |42       |58       |2023-06-12 18:15:42.58|
+---+-----------------+----+-----+---+----+------+---------+----------------------+

正如您所看到的,转换成功了,时间戳采用了预期的格式。

更新

from pyspark.sql.functions import *

df.withColumn("timestamp",
    when(col("input_timestamp").contains("."), to_timestamp("input_timestamp", "yyyyMMddHHmmss.SS"))
    .otherwise(to_timestamp("input_timestamp", "yyyyMMddHHmmssSS"))
).show(truncate=False)

相关问题