如何从Hive 3和Spark 3中可靠地获取0001-01-01的正确时间戳？

3gtaxfhh 于 2023-02-19 发布在 Hive

关注(0)|答案(1)|浏览(268)

bounty将在5天后过期。回答此问题可获得+200的声誉奖励。Alexey正在寻找来自声誉良好的来源的答案。

我有一个非常基本的csv格式的新年日期，从1970-01-01 00：00：00到0000-01-01 00：00：00，我已经将它作为外部表test.ny(dt string)提供给Hive。
在配置单元2中创建 parquet table时：

create table test.ny2 stored as parquet
as
select 
dt, 
unix_timestamp(dt||' 00:00:00') dt2,
cast(dt as timestamp) dt3
from test.ny --this is my csv

如果我set spark.sql.legacy.parquet.int96RebaseModeInRead=LEGACY，我可以通过spark-sql访问它。所有dt 3值都正确读取为YYYY-01-01 00:00:00
但是，当我通过Hive 3访问同一个表时，在

dt                  dt2         dt3
1901-01-01 00:00:00 -2177461817 1901-01-01 00:00:00.000
1900-01-01 00:00:00 -2208999600 1899-12-31 23:30:17.000

这可以解释为在配置单元2中错误地应用了tzdb，并且在最后应用了另一个tzdb：

dt                  dt2             dt3
0003-01-01 00:00:00 -62072708400    0002-12-29 23:30:17.000
0002-01-01 00:00:00 -62104244400    0001-12-29 23:30:17.000
0001-01-01 00:00:00 -62135780400    0001-12-29 23:30:17.000
0000-01-01 00:00:00 -62167402800    0002-12-29 23:30:17.000

这还不是全部，当我在Hive 3.1.3中从头开始重新创建相同的表时：

create table test.ny3 stored as parquet
as
select 
dt, 
unix_timestamp(dt||' 00:00:00') dt2,
cast(dt as timestamp) dt3
from test.ny --this is my csv

当我在Hive中选择它时，我得到了第二个错误！

dt                  dt2             dt3
0003-01-01 00:00:00 -62072697600    0003-01-01 00:00:00.000
0002-01-01 00:00:00 -62104233600    0002-01-01 00:00:00.000
0001-01-01 00:00:00 -62135769600    0002-01-01 00:00:00.000
0000-01-01 00:00:00 -62167392000    0002-01-01 00:00:00.000

我也不能通过spark-sql选择我想要的数据，无论我使用什么模式，遗留（这是可以理解的）：

dt                       dt2             dt3
0003-01-01 00:00:00      -62072697600    0003-01-03 00:29:43
0002-01-01 00:00:00      -62104233600    0002-01-03 00:29:43
0001-01-01 00:00:00      -62135769600    0001-01-03 00:29:43
0000-01-01 00:00:00      -62167392000    0001-01-03 00:29:43

或更正（它几乎是正确的）：

dt                       dt2             dt3
0003-01-01 00:00:00      -62072697600    0003-01-01 00:00:00
0002-01-01 00:00:00      -62104233600    0002-01-01 00:00:00
0001-01-01 00:00:00      -62135769600    0001-01-01 00:00:00
0000-01-01 00:00:00      -62167392000    0001-01-01 00:00:00 --notice the year!

问题1：Hive 3为什么不能正确处理0000年和0001年的时间戳？
问题2：如何在同一个Spark会话中读取旧表（由Hive 2编写）和新表（由Hive 3编写）？