使用spark 3.3.2转换带有毫秒的datetime字符串需要一个强制的点

i34xakig 于 2023-06-24 发布在 Apache

关注(0)|答案(1)|浏览(189)

我的数据集中有这个datetime字符串：'2023061218154258'，我想用下面的代码将其转换为日期时间。然而，我希望工作的格式，不工作，即：yyyyMMddHHmmssSS。此代码将重现问题：

from pyspark.sql.functions import *
spark.conf.set("spark.sql.legacy.timeParserPolicy","CORRECTED")
# If the config is set to CORRECTED then the conversion will return null instead of throwing an exception.

df=spark.createDataFrame(
         data=[ ("1",  "2023061218154258")
                , ("2", "20230612181542.58")]
        ,schema=["id","input_timestamp"])
df.printSchema()

#Timestamp String to DateType
1. df.withColumn("timestamp",to_timestamp("input_timestamp", format = 'yyyyMMddHHmmssSS')).show(truncate=False)
df.withColumn("timestamp",to_timestamp("input_timestamp", format = 'yyyyMMddHHmmss.SS')).show(truncate=False)

输出：

+---+-----------------+---------+
|id |input_timestamp  |timestamp|
+---+-----------------+---------+
|1  |2023061218154258 |null     |
|2  |20230612181542.58|null     |
+---+-----------------+---------+

+---+-----------------+----------------------+
|id |input_timestamp  |timestamp             |
+---+-----------------+----------------------+
|1  |2023061218154258 |null                  |
|2  |20230612181542.58|2023-06-12 18:15:42.58|
+---+-----------------+----------------------+

我尝试使用yyyyMMddHHmmssSS格式的_timestamp，我期望它将字符串2023061218154258转换为时间戳2023-06-12 18：15：42.58

apache-spark

来源：https://stackoverflow.com/questions/76506611/converting-datetime-string-with-milliseconds-using-spark-3-3-2-requires-a-mandat

1条答案

按热度按时间

2g32fytz1#

您遇到的问题是由于PySpark中to_timestamp函数的限制。to_timestamp函数期望时间戳格式符合Java SimpleDateFormat标准，该标准不支持超过毫秒（SSS）的亚秒精度。
在您的示例中，格式'yyyyMMddHHmmssSS'将不起作用，因为它期望秒级部分正好有两位数字。为了克服这个限制，您可以手动解析字符串，并使用PySpark中可用的其他函数将其转换为时间戳。
下面是一个示例，你可以通过从字符串中提取不同的成分，然后使用concat函数创建时间戳来实现所需的转换：

from pyspark.sql.functions import *

df.withColumn("year", substring("input_timestamp", 1, 4)) \
  .withColumn("month", substring("input_timestamp", 5, 2)) \
  .withColumn("day", substring("input_timestamp", 7, 2)) \
  .withColumn("hour", substring("input_timestamp", 9, 2)) \
  .withColumn("minute", substring("input_timestamp", 11, 2)) \
  .withColumn("second", substring("input_timestamp", 13, 2)) \
  .withColumn("subsecond", substring("input_timestamp", 15, 2)) \
  .withColumn("timestamp", concat(col("year"), lit("-"), col("month"), lit("-"), col("day"),
                                  lit(" "), col("hour"), lit(":"), col("minute"), lit(":"),
                                  col("second"), lit("."), col("subsecond"))) \
  .withColumn("timestamp", to_timestamp("timestamp")) \
  .show(truncate=False)

这段代码从input_timestamp列中提取各个组成部分（年、月、日、小时、分钟、秒、亚秒），并用适当的分隔符将它们连接起来，形成一个时间戳字符串。然后，应用to_timestamp函数将结果字符串转换为时间戳。
输出应如下所示：

+---+-----------------+----+-----+---+----+------+---------+----------------------+
|id |input_timestamp  |year|month|day|hour|minute|second   |subsecond|timestamp             |
+---+-----------------+----+-----+---+----+------+---------+----------------------+
|1  |2023061218154258|2023|06   |12 |18  |15    |42       |58       |2023-06-12 18:15:42.58|
|2  |20230612181542.58|2023|06   |12 |18  |15    |42       |58       |2023-06-12 18:15:42.58|
+---+-----------------+----+-----+---+----+------+---------+----------------------+

正如您所看到的，转换成功了，时间戳采用了预期的格式。

更新

from pyspark.sql.functions import *

df.withColumn("timestamp",
    when(col("input_timestamp").contains("."), to_timestamp("input_timestamp", "yyyyMMddHHmmss.SS"))
    .otherwise(to_timestamp("input_timestamp", "yyyyMMddHHmmssSS"))
).show(truncate=False)

赞(0）回复(0）举报 2023-06-24

我来回答

使用spark 3.3.2转换带有毫秒的datetime字符串需要一个强制的点

1条答案

相关问题

热门标签

最新问答