pyspark在将数据写入hdfs时无法覆盖特定的分区数据

t2a7ltrp 于 2021-06-24 发布在 Hive

关注(0)|答案(0)|浏览(344)

我有一张table customer_table 在下面三列的基础上进行划分。因此，在hdfs中，它如下所示：

date=1901 > cus_id=A > cus_type=online > file
          > cus_id=B > cus_type=online > file
          > cus_id=C > cus_type=online > file
date=1902 > cus_id=A > cus_type=online > file
          > cus_id=B > cus_type=online > file
          > cus_id=C > cus_type=online > file
date=1903 > cus_id=A > cus_type=online > file
          > cus_id=B > cus_type=online > file
          > cus_id=C > cus_type=online > file

现在，我已经过滤了输入数据，只考虑了 cus_id = A ```
df_filtered = df_input.filter(df.cus_id == "A")
df_filtered {dataframe has data from 1901, 1902 and 1903}

我已经完成了Dataframe操作和新计算的Dataframe `df_filter_updated` 必须覆盖到customer表中
所以，只有 `cus_id=A` 必须替换所有date=****hdfs文件夹中的文件夹数据。
我们正在进行以下操作：

df_filter_updated.write.option("compression", "snappy").mode("overwrite")
.partitionby("date", "cus_id", "cus_type").parquet(hdfs_path)

但是，它会覆盖整个表，而不是特定的分区文件夹。我们如何实现这种覆盖方式？
实际上，我之所以要做这个操作，是为了计算表中所有旧数据的新列 `customer_table` .

Hive python DataFrame pyspark pyspark-dataframes

来源：https://stackoverflow.com/questions/63355068/pyspark-unable-to-overwrite-specific-partitioned-data-when-writing-data-into-hdf

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

pyspark在将数据写入hdfs时无法覆盖特定的分区数据

暂无答案！

相关问题

热门标签

最新问答