spark物理计划中的步骤未指定给dag步骤

rdlzhqv9 于 2021-06-25 发布在 Hive

关注(0)|答案(1)|浏览(376)

我试图调试sparksql中返回错误数据的简单查询。
在本例中，查询是两个配置单元表之间的简单联接。。这个问题似乎与spark生成的物理计划（使用catalyst引擎）中的某个步骤看起来已损坏有关，其中物理计划中的某些步骤尚未分配订单id，因此spark查询中连接右侧的所有计算都未完成
下面是示例查询

from pyspark_llap import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()

filter_1 = hive.executeQuery('select * from 03_score where scores = 5 or scores = 6')
filter_2 = hive.executeQuery('select * from 03_score where scores = 8')

joined_df = filter_1.alias('o').join(filter_2.alias('po'), filter_1.encntr_id == filter_2.encntr_id, how='inner')
joined_df.count() ### shows incorrect value ### 
joined_df.explain(True)

下面是spark返回的物理计划的一个示例

== Physical Plan ==
 SortMergeJoin [encntr_id#0], [encntr_id#12], Inner
:- *(2) Sort [encntr_id#0 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(encntr_id#0, 200)
:     +- *(1) Filter isnotnull(encntr_id#0)
:        +- *(1) DataSourceV2Scan [encntr_id#0, scores_datetime#1, scores#2], com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader@a6df563
+-  Sort [encntr_id#12 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(encntr_id#12, 200)
      +-  Filter isnotnull(encntr_id#12)
         +- DataSourceV2Scan [encntr_id#12, dateofbirth#13, postcode#14, event_desc#15, event_performed_dt_tm#16], com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader@60dd22d9

请注意，联接右侧的所有数据源扫描、筛选器交换和排序都没有分配订单id。
有人能帮我解释一下这个问题吗。。为什么看起来正确的物理计划没有分配评估订单id？

Hive apache-spark cloudera hortonworks-data-platform catalyst-optimizer

来源：https://stackoverflow.com/questions/60155823/steps-in-spark-physical-plan-not-assigned-to-dag-step

1条答案

按热度按时间

nlejzf6q1#

我从内部就知道了。
结果表明，spark优化例程可能会受到配置设置的影响
spark.sql.codegen.maxfields
这对spark如何优化“fat”表的读取有一定的影响。
在我的例子中，设置被设置为低，这意味着从连接的右侧（“fat”表）读取的dag阶段被执行，而没有被分配给whitestage codegen。
重要的是要注意，在这两种情况下，对配置单元数据的读取都返回相同的结果，只是对物理计划应用了不同的优化

赞(0）回复(0）举报 2021-06-25

我来回答

spark物理计划中的步骤未指定给dag步骤

1条答案

相关问题

热门标签

最新问答