spark:区分大小写的按列分区

toe95027 于 2021-06-28 发布在 Hive

关注(0)|答案(1)|浏览(516)

我正试图用分区键以hivecontext（orc格式）写出一个Dataframe：

df.write().partitionBy("event_type").mode(SaveMode.Overwrite).orc("/path");

但是，我尝试对其进行分区的列具有区分大小写的值，这在写入时引发了一个错误：

Caused by: java.io.IOException: File already exists: file:/path/_temporary/0/_temporary/attempt_201607262359_0001_m_000000_0/event_type=searchFired/part-r-00000-57167cfc-a9db-41c6-91d8-708c4f7c572c.orc
``` `event_type` 列同时具有 `searchFired` 以及 `SearchFired` 作为价值观。但是，如果我从Dataframe中删除其中一个，那么我就能够成功地编写。我该怎么解决这个问题？

Hive apache-spark apache-spark-sql spark-dataframe

来源：https://stackoverflow.com/questions/38597401/spark-case-sensitive-partitionby-column

1条答案

按热度按时间

yzxexxkh1#

依赖文件系统中的大小写差异通常不是一个好主意。
解决方案是使用（使用scala dsl）将大小写不同的值组合到同一个分区中：

df
  .withColumn("par_event_type", expr("lower(event_type)"))
  .write
  .partitionBy("par_event_type")
  .mode(SaveMode.Overwrite)
  .orc("/path")

这为分区添加了一个额外的列。如果这会导致问题，你可以使用 drop 读取数据时将其删除。

赞(0）回复(0）举报 2021-06-28

我来回答

spark:区分大小写的按列分区

1条答案

相关问题

热门标签

最新问答