spark orc微调(文件大小、条纹)

我的问题有两部分：

如何使用spark设置（微调）高级orc参数？

各种帖子显示，可能有问题Spark小兽人条纹，如何设置兽人条纹大小在Spark。我目前正在HDP2.6.4平台上使用spark 2.2，因此根据https://community.cloudera.com/t5/support-questions/spark-orc-stripe-size/td-p/189844 这应该已经解决了。但是，我不清楚在执行时如何设置这些参数：

df.write.orc("/path/to/file")

也许这只是一个：

df.write.options(Map("key"-> "value")).orc("/path/to/file")

不过，我也不太清楚我在这里需要哪些钥匙。
注：1.4 native 使用orc版本。

.set("spark.sql.orc.impl", "native")
  .set("spark.sql.hive.convertMetastoreOrc", "true")

选择正确的参数：

我的数据集使用 df.repartition(number, c1,c2,...).sortWithin("c1, c2", "c3", ...) 第二种。排序列的顺序由昂贵（长字符串）列的基数选择。最高的优先。

文件大小

我想把gzip的orc文件写入hdfs。小文件大小的问题是我意识到的，显然是要防止的-但另一个方向呢？例如，如果相应地重新分区，我的一个数据集将生成800mbgzip orc文件（分区内的单个文件）。这800mb已经被认为太大了吗？我应该试着把它们的大小大约定在300mb左右吗？还是400兆？请记住，它们已经被压缩了。

条纹尺寸

目前，我观察到：

java-jar orc-tools meta foo.orc

对于这个文件，spark似乎创建了大约16mb大小的条纹，在这个特殊的例子中是49。
下面是第一个条带的输出示例：

Stripe 1:
    Column 0: count: 3845120 hasNull: false
    Column 1: count: 3845120 hasNull: false min: a max: b sum: 246087680
    Column 2: count: 3845120 hasNull: false min: aa max: bb sum: 30288860
    Column 3: count: 3845120 hasNull: false min: aaa max: bbb sum: 89174415
    Column 4: count: 3845120 hasNull: false
    Column 5: count: 3845120 hasNull: false min: 2019-09-24 00:00:00.0 max: 2019-09-24 23:45:00.0 min UTC: 2019-09-24 02:00:00.0 max UTC: 2019-09-25 01:45:00.0
    Column 6: count: 3845120 hasNull: false min: 2019-09-24 00:15:00.0 max: 2019-09-25 00:00:00.0 min UTC: 2019-09-24 02:15:00.0 max UTC: 2019-09-25 02:00:00.0
    Column 7: count: 3845120 hasNull: false min: 1 max: 36680 sum: 36262602

在列出所有条带后的详细输出中（同样对于第一条带）：

Stripes:
  Stripe: offset: 3 data: 17106250 rows: 3845120 tail: 185 index: 51578
    Stream: column 0 section ROW_INDEX start: 3 length 55
    Stream: column 1 section ROW_INDEX start: 58 length 21324
    Stream: column 2 section ROW_INDEX start: 21382 length 3944
    Stream: column 3 section ROW_INDEX start: 25326 length 12157
    Stream: column 4 section ROW_INDEX start: 37483 length 55
    Stream: column 5 section ROW_INDEX start: 37538 length 4581
    Stream: column 6 section ROW_INDEX start: 42119 length 4581
    Stream: column 7 section ROW_INDEX start: 46700 length 4881
    Stream: column 1 section DATA start: 51581 length 57693
    Stream: column 1 section LENGTH start: 109274 length 16
    Stream: column 1 section DICTIONARY_DATA start: 109290 length 623365
    Stream: column 2 section DATA start: 732655 length 447898
    Stream: column 2 section LENGTH start: 1180553 length 148
    Stream: column 2 section DICTIONARY_DATA start: 1180701 length 968
    Stream: column 3 section DATA start: 1181669 length 2449521
    Stream: column 3 section LENGTH start: 3631190 length 6138
    Stream: column 3 section DICTIONARY_DATA start: 3637328 length 303255
    Stream: column 5 section DATA start: 3940583 length 5329298
    Stream: column 5 section SECONDARY start: 9269881 length 172
    Stream: column 6 section DATA start: 9270053 length 5334123
    Stream: column 6 section SECONDARY start: 14604176 length 172
    Stream: column 7 section DATA start: 14604348 length 2553483
    Encoding column 0: DIRECT
    Encoding column 1: DICTIONARY_V2[16914]
    Encoding column 2: DICTIONARY_V2[214]
    Encoding column 3: DICTIONARY_V2[72863]
    Encoding column 4: DIRECT
    Encoding column 5: DIRECT_V2
    Encoding column 6: DIRECT_V2
    Encoding column 7: DIRECT_V2

这里推荐什么？配置单元默认值似乎提到了256mb，但这似乎是一个与spark计算的值完全不同的值范围。这里的理由是什么？
那么为什么：

spark.conf.get("orc.dictionary.key.threshold")
java.util.NoSuchElementException: orc.dictionary.key.threshold

即使可以清楚地看到字典是如何设置的，也会失败吗？查看spark的代码库，我无法识别在任何地方设置的这个属性https://github.com/apache/spark/search?q=orc.dictionary.key.threshold&unscoped_q=orc.dictionary.key.threshold

兽人糖果

orc的最新版本引入了bloom过滤器和索引。这些也可以用Spark吗？

其他调整提示

请与我分享任何其他调整技巧。

未完成的学习

问题的相当一部分仍然悬而未决。请改进答案。

如何在spark中设置调谐参数

对于orc高级设置：
https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html

usersDF.write.format("orc") .option("orc.bloom.filter.columns", "favorite_color") .option("orc.dictionary.key.threshold", "1.0") .save("users_with_options.orc")

事实上，可以简单地作为 .option 到 writer . 如果你想设置这些启动时使用Spark --conf 一定要给它们加上前缀 spark.orc.bloom.filter.columns 否则它们将被忽略。

选择参数

文件大小和条带大小

选择正确的文件大小很重要。越大越好。事实上，我可以观察到5个文件和10个文件在1gb左右的差异（5个文件的存储需求较少）。
https://community.cloudera.com/t5/community-articles/orc-creation-best-practices/ta-p/248963 orc文件在条带级别上是可拆分的。条带大小是可配置的，并且应该取决于记录的平均长度（大小）以及您可以拥有的那些排序字段的唯一值的数量。如果按字段搜索是唯一的（或几乎是唯一的），则减小条带大小，如果重复次数过多，则增大条带大小。虽然默认值为64 mb，但将条带大小保持在两者之间¼ 块大小为4块大小（默认orc块大小为256 mb
这意味着更大的条纹更好，但在加载过程中创建更耗时（折衷）。

兽人糖果

索引在Hive3.0中从hives端删除，因为它们的功能直接在orc文件中实现（排序时的最小-最大值对于范围非常有效，对于相等联接条件则使用bloomfilter）。https://cwiki.apache.org/confluence/display/hive/languagemanual+indexing
另外，创建一个bloomfilter是有意义的，但是在存储和时间上有一个折衷。创建过滤器时，可以按照前面的概述查看这些过滤器 orce-tools ```
Stripes:
Stripe: offset: 3 data: 20833464 rows: 3475000 tail: 256 index: 3981255
Stream: column 0 section ROW_INDEX start: 3 length 52
Stream: column 0 section BLOOM_FILTER start: 55 length 17940
Stream: column 1 section ROW_INDEX start: 17995 length 31010
Stream: column 1 section BLOOM_FILTER start: 49005 length 610564
Stream: column 2 section ROW_INDEX start: 659569 length 4085
Stream: column 2 section BLOOM_FILTER start: 663654 length 378695
Stream: column 3 section ROW_INDEX start: 1042349 length 11183
Stream: column 3 section BLOOM_FILTER start: 1053532 length 1936342

分类是至关重要的(https://community.cloudera.com/t5/community-articles/orc-creation-best-practices/ta-p/248963)并且应该作为二级排序执行（如问题中所述）。 ### 参数这似乎很有用，不需要超时间密集的微调：

orc.dictionary.key.threshold=0.95 # force dict (almost) always (seems useful for almost all (non streaming) use cases)
orc.bloom.filter.columns "*" # do not use star, but select desired columns to save space

另外，orc.column.encoding.directhttps://orc.apache.org/specification/orcv1/ （搜索这些不同的编码）可能有意义。 spark建议https://spark.apache.org/docs/latest/cloud-integration.html:

spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true

### 额外良好阅读 https://www.slideshare.net/benjaminleonhardi/hive-loading-data ### 还有什么问题有关选择正确参数的详细信息为什么spark中的条带大小要小得多（16-20mb与推荐的64mb相比）？也许我需要调整一下步幅。为什么它们保持这么小（即使在试图增加它们的时候）。请记住：2.2.x、hdp2.6.4和本机orc支持应该已经修复。什么时候使用过滤器，什么时候过度杀戮？ https://www.slideshare.net/benjaminleonhardi/hive-loading-data ![](https://i.stack.imgur.com/0YcV7.png) ![](https://i.stack.imgur.com/UsgVw.png)

spark orc微调(文件大小、条纹)

如何使用spark设置（微调）高级orc参数？

选择正确的参数：

文件大小

条纹尺寸

兽人糖果

其他调整提示

1条答案

未完成的学习

如何在spark中设置调谐参数

选择参数

文件大小和条带大小

兽人糖果

相关问题

热门标签

最新问答