spark:错误的时间戳解析

4zcjmb1e  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(585)

我正在创建以下Dataframe

syncs.select($"event.timestamp",to_date($"event.timestamp".cast(TimestampType))).show

这包括以下行

timestamp|to_date(CAST(`event.timestamp` AS TIMESTAMP))|
-------------+---------------------------------------------+
1589509800768|                                  52339-07-25|
1589509802730|                                  52339-07-25|
1589509809092|                                  52339-07-25|
1589509810402|                                  52339-07-25|
1589509812112|                                  52339-07-25|
1589509817489|                                  52339-07-25|
1589509818065|                                  52339-07-25|
1589509818902|                                  52339-07-25|
1589509819020|                                  52339-07-25|
1589509819425|                                  52339-07-25|
1589509819830|                                  52339-07-25|

基于此 1589509800768 是2020年5月15日星期五02:30:00。
我不明白为什么我会得到这些未来的日期。从时间戳到日期的转换是否也需要某种日期格式?

ql3eal8s

ql3eal8s1#

首先,应该将秒传递到毫秒,然后转换为时间戳或日期

object ToTimestamp exteds App{
  val spark = SparkSession
    .builder()
    .appName("ToTimestamp")
    .master("local[*]")
    .config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
    .config("spark.app.id","ToTimestamp") // To silence Metrics warning
    .getOrCreate()

  val sc = spark.sparkContext

  import org.apache.spark.sql.functions._
  import spark.implicits._

  val data = sc.parallelize(List(1589509800768L,1589509802730L,1589509809092L,1589509810402L)).toDF("millis")

  val toTimestamp = data.withColumn("timestamp", from_unixtime(col("millis") / 1000))
  toTimestamp.show(truncate = false)
/*
+-------------+-------------------+
|millis       |timestamp          |
+-------------+-------------------+
|1589509800768|2020-05-15 04:30:00|
|1589509802730|2020-05-15 04:30:02|
|1589509809092|2020-05-15 04:30:09|
|1589509810402|2020-05-15 04:30:10|
+-------------+-------------------+

* /

  val toDate = toTimestamp.selectExpr("millis", "timestamp").withColumn("data", to_date(col("timestamp")))
  toDate.show(truncate = false)

/*
+-------------+-------------------+----------+
|millis       |timestamp          |data      |
+-------------+-------------------+----------+
|1589509800768|2020-05-15 04:30:00|2020-05-15|
|1589509802730|2020-05-15 04:30:02|2020-05-15|
|1589509809092|2020-05-15 04:30:09|2020-05-15|
|1589509810402|2020-05-15 04:30:10|2020-05-15|
+-------------+-------------------+----------+

* /

}
lrpiutwd

lrpiutwd2#

spark需要以秒为单位的epoch时间,而不是毫秒,因此可以将它除以1000。

scala> val values = List(1589509800768L)
values: List[Long] = List(1589509800768)

scala> val df = values.toDF()
df: org.apache.spark.sql.DataFrame = [value: bigint]

scala> df.show(false)
+-------------+
|value        |
+-------------+
|1589509800768|
+-------------+

scala> df.select((col("value") / 1000 ).cast(TimestampType).as("current_time")).show(false)
+-----------------------+
|current_time           |
+-----------------------+
|2020-05-14 19:30:00.768|
+-----------------------+

scala> df.select((col("value") / 1000 ).cast(TimestampType).as("current_time")).withColumn("time_utc",
     |   expr("""to_utc_timestamp(current_time, "PST")""")
     | ).show(false)
+-----------------------+-----------------------+
|current_time           |time_utc               |
+-----------------------+-----------------------+
|2020-05-14 19:30:00.768|2020-05-15 02:30:00.768|
+-----------------------+-----------------------+

相关问题