我目前正在用map reduce编写parquet,我将行组大小配置为256m,hdfs块大小也配置为256m。每个文件的输出文件大小约为1g。
所以在生成的文件中应该有4个行组。但当我使用: parquet-tools meta path/to/my/file | grep "row group"
它给了我63个不同大小和行数的行组:
row group 1: RC:69816 TS:244168913
row group 2: RC:35111 TS:117407826
row group 3: RC:18488 TS:60107388
row group 4: RC:10357 TS:33260415
row group 5: RC:7905 TS:24956045
row group 6: RC:4754 TS:15149122
row group 7: RC:3862 TS:12476651
row group 8: RC:2738 TS:9001631
row group 9: RC:2104 TS:7120040
row group 10: RC:1910 TS:6398391
row group 11: RC:1508 TS:5219072
row group 12: RC:1386 TS:4676154
row group 13: RC:1124 TS:3950635
row group 14: RC:999 TS:3518545
row group 15: RC:865 TS:3121657
row group 16: RC:774 TS:2801614
row group 17: RC:678 TS:2490904
row group 18: RC:511 TS:1996167
row group 19: RC:69808 TS:243894989
row group 20: RC:30176 TS:99585195
row group 21: RC:20678 TS:67779524
row group 22: RC:10743 TS:34547874
row group 23: RC:8258 TS:26080110
row group 24: RC:5227 TS:16456577
row group 25: RC:4136 TS:13321721
row group 26: RC:3207 TS:10272043
row group 27: RC:2437 TS:8107932
row group 28: RC:1945 TS:6563867
row group 29: RC:1561 TS:5320028
row group 30: RC:1389 TS:4809485
row group 31: RC:1206 TS:4251584
row group 32: RC:996 TS:3581746
row group 33: RC:895 TS:3203224
row group 34: RC:757 TS:2869939
row group 35: RC:653 TS:2550716
row group 36: RC:531 TS:2008746
row group 37: RC:69706 TS:244420245
row group 38: RC:32703 TS:109391929
row group 39: RC:18640 TS:60918458
row group 40: RC:10737 TS:34272225
row group 41: RC:7812 TS:24814707
row group 42: RC:5176 TS:16206655
row group 43: RC:4123 TS:13224377
row group 44: RC:3391 TS:10946649
row group 45: RC:2138 TS:7248145
row group 46: RC:1960 TS:6566944
row group 47: RC:1538 TS:5294523
row group 48: RC:1355 TS:4744634
row group 49: RC:1225 TS:4194625
row group 50: RC:1026 TS:3587484
row group 51: RC:877 TS:3134267
row group 52: RC:785 TS:2846718
row group 53: RC:675 TS:2546836
row group 54: RC:538 TS:2016450
row group 55: RC:69762 TS:244915809
row group 56: RC:32390 TS:108310300
row group 57: RC:18095 TS:58754777
row group 58: RC:10759 TS:34405301
row group 59: RC:8195 TS:26029310
row group 60: RC:5286 TS:16597963
row group 61: RC:4231 TS:13415076
row group 62: RC:3538 TS:11465640
row group 63: RC:135 TS:688850
行组有一个递归模式,有人知道为什么parquet不符合我配置的行组大小(256m)吗?
1条答案
按热度按时间yeotifhr1#
在使用parquet mr编写parquet文件时,这是一个尚未解决的问题。该算法不考虑压缩,创建的行组比预期的多。
你可以在这里找到更多信息:https://issues.apache.org/jira/browse/parquet-1337