什么决定了Parquet文件缓冲区的大小

5rgfhyps  于 2021-06-01  发布在  Hadoop
关注(0)|答案(1)|浏览(442)

我在sparkshell中编写了一个Dataframe到hdfs中,得到了下面的输出。我想了解的是,是什么决定了正在编写的Parquet文件的大小?my dfs.block.size设置为:

scala> spark.sparkContext.hadoopConfiguration.get("dfs.block.size")
res1: String = 134217728

这是128mb,那么为什么我的文件在20000000字节的范围内呢?

-rw-r--r--   1 hadoop supergroup          0 2018-11-13 11:51 /new_sample_parquet_test/_SUCCESS
-rw-r--r--   1 hadoop supergroup   23631191 2018-11-13 11:51 /new_sample_parquet_test/part-00000-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23435545 2018-11-13 11:51 /new_sample_parquet_test/part-00001-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22568091 2018-11-13 11:51 /new_sample_parquet_test/part-00002-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23385544 2018-11-13 11:51 /new_sample_parquet_test/part-00003-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23335676 2018-11-13 11:51 /new_sample_parquet_test/part-00004-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23423372 2018-11-13 11:51 /new_sample_parquet_test/part-00005-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22182760 2018-11-13 11:51 /new_sample_parquet_test/part-00006-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20906453 2018-11-13 11:51 /new_sample_parquet_test/part-00007-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22918107 2018-11-13 11:51 /new_sample_parquet_test/part-00008-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21655224 2018-11-13 11:51 /new_sample_parquet_test/part-00009-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20366872 2018-11-13 11:51 /new_sample_parquet_test/part-00010-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22658141 2018-11-13 11:51 /new_sample_parquet_test/part-00011-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22246580 2018-11-13 11:51 /new_sample_parquet_test/part-00012-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20648612 2018-11-13 11:51 /new_sample_parquet_test/part-00013-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22369663 2018-11-13 11:51 /new_sample_parquet_test/part-00014-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23396027 2018-11-13 11:51 /new_sample_parquet_test/part-00015-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23382811 2018-11-13 11:51 /new_sample_parquet_test/part-00016-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   17470540 2018-11-13 11:51 /new_sample_parquet_test/part-00017-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22669018 2018-11-13 11:51 /new_sample_parquet_test/part-00018-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21899425 2018-11-13 11:51 /new_sample_parquet_test/part-00019-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21378060 2018-11-13 11:51 /new_sample_parquet_test/part-00020-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21578176 2018-11-13 11:51 /new_sample_parquet_test/part-00021-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21079291 2018-11-13 11:51 /new_sample_parquet_test/part-00022-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21526313 2018-11-13 11:51 /new_sample_parquet_test/part-00023-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22446489 2018-11-13 11:51 /new_sample_parquet_test/part-00024-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21770955 2018-11-13 11:51 /new_sample_parquet_test/part-00025-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   23199003 2018-11-13 11:51 /new_sample_parquet_test/part-00026-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21833916 2018-11-13 11:51 /new_sample_parquet_test/part-00027-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   25090443 2018-11-13 11:51 /new_sample_parquet_test/part-00028-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20725755 2018-11-13 11:51 /new_sample_parquet_test/part-00029-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20666565 2018-11-13 11:51 /new_sample_parquet_test/part-00030-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22299474 2018-11-13 11:51 /new_sample_parquet_test/part-00031-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22327133 2018-11-13 11:51 /new_sample_parquet_test/part-00032-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22207468 2018-11-13 11:51 /new_sample_parquet_test/part-00033-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22630251 2018-11-13 11:51 /new_sample_parquet_test/part-00034-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21648270 2018-11-13 11:51 /new_sample_parquet_test/part-00035-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22230127 2018-11-13 11:51 /new_sample_parquet_test/part-00036-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22439910 2018-11-13 11:51 /new_sample_parquet_test/part-00037-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22252551 2018-11-13 11:51 /new_sample_parquet_test/part-00038-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22160655 2018-11-13 11:51 /new_sample_parquet_test/part-00039-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   17637580 2018-11-13 11:51 /new_sample_parquet_test/part-00040-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21743969 2018-11-13 11:51 /new_sample_parquet_test/part-00041-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22071235 2018-11-13 11:51 /new_sample_parquet_test/part-00042-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21854771 2018-11-13 11:51 /new_sample_parquet_test/part-00043-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   25243330 2018-11-13 11:51 /new_sample_parquet_test/part-00044-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22297865 2018-11-13 11:51 /new_sample_parquet_test/part-00045-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22070057 2018-11-13 11:51 /new_sample_parquet_test/part-00046-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22018671 2018-11-13 11:51 /new_sample_parquet_test/part-00047-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21796749 2018-11-13 11:51 /new_sample_parquet_test/part-00048-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22597634 2018-11-13 11:51 /new_sample_parquet_test/part-00049-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20728588 2018-11-13 11:51 /new_sample_parquet_test/part-00050-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22137701 2018-11-13 11:51 /new_sample_parquet_test/part-00051-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22387635 2018-11-13 11:51 /new_sample_parquet_test/part-00052-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20965957 2018-11-13 11:51 /new_sample_parquet_test/part-00053-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20314451 2018-11-13 11:51 /new_sample_parquet_test/part-00054-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   22538965 2018-11-13 11:51 /new_sample_parquet_test/part-00055-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20923261 2018-11-13 11:51 /new_sample_parquet_test/part-00056-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20984805 2018-11-13 11:51 /new_sample_parquet_test/part-00057-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20513317 2018-11-13 11:51 /new_sample_parquet_test/part-00058-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   25493903 2018-11-13 11:51 /new_sample_parquet_test/part-00059-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21178862 2018-11-13 11:51 /new_sample_parquet_test/part-00060-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   20696540 2018-11-13 11:51 /new_sample_parquet_test/part-00061-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   21011416 2018-11-13 11:51 /new_sample_parquet_test/part-00062-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r--   1 hadoop supergroup   15752503 2018-11-13 11:51 /new_sample_parquet_test/part-00063-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
fivyi3re

fivyi3re1#

parquet writer与hdfs块大小无关,因为您可以将parquet保存在本地硬盘上。决定单个part-.parquet文件的数量和大小的是Dataframe中的分区数(本例中为64)。如果你愿意的话 df.coalesce(1).write.parquet(...) ,您将只有一个大的部分文件。
如果希望每个零件文件的大小约为128 mb,则coalesce参数应为20
64/128=10。但是,给定数量的合并分区依赖项的零件文件大小并不是严格线性的。零件文件的数量越少,编码/压缩的效率就越高。
有关详细信息,请参见合并方法描述

相关问题