spark流媒体创建许多小文件

polhcujo 于 2021-05-29 发布在 Hadoop

关注(0)|答案(1)|浏览(392)

我已经实现了一个spark流作业，它将过去6个月接收到的事件流到hdfs中。
它在hdfs中创建了许多小文件，我希望每个文件的大小都是hdfs的128mb（块大小）。
如果我使用append模式，所有的数据将被写入一个parquet文件。
如何配置spark为每128MB的数据创建一个新的hdfsParquet文件？

hadoop apache-spark pyspark

来源：https://stackoverflow.com/questions/51682016/spark-streaming-creating-many-small-files

1条答案

按热度按时间

q3qa4bjr1#

spark将在写入之前在对象上写入与分区数量相同的文件。它可能真的很低效。若要减少零件文件的总数，请尝试此操作，它将检查对象的总字节大小，并将其重新分区为+1最佳大小。

import org.apache.spark.util.SizeEstimator

val inputDF2 : Long = SizeEstimator.estimate(inputDF.rdd)
//find its appropiate number of partitions
val numPartitions : Long = (inputDF2/134217728) + 1
 //write it out with that many partitions
 val outputDF = inputDF.repartition(numPartitions.toInt)

赞(0）回复(0）举报 2021-05-29

我来回答

spark流媒体创建许多小文件

1条答案

相关问题

热门标签

最新问答