hive 配置单元HQL查询的执行时间较长,对此查询进行任何优化都将有所帮助

fhity93d  于 2023-01-13  发布在  Hive
关注(0)|答案(1)|浏览(353)

概述

我有一个大小为80 TB的clouddb.transaction表。
这将与input_parameters表连接,该表在任何时候都有1到10条记录。
当我运行下面的hive查询时,在Spark中执行它花了23分钟,并创建了137843个任务。
请求论坛建议,如果有任何更好的优化,我可以做这个查询。

Select DISTINCT
b.batch_id,
b.sid,
b.cust_id,
b.acc_no,
a.debit_11,
a.credit_13,
a.debit_15,
a.amount,
a.conversion_amount,
a.curr_cd,
a.apv_deny_cd,
a.approval,
a.sed10,
a.sed_pkey,
a.time_of_day_in,
a.auth_date,
a.atm_terminal_id,
a.atm_location_addr,
a.atm_street_addr,
a.atm_city_nm,
a.atm_state_cd,
a.atm_country_cd,
a.atm_zip_cd,
a.atm_country,
a.trx_1,
a.trx_2,
a.trx_3,
a.trx_4,
a.trx_5,
a.trx_6,
a.trx_7,
a.trx8
FROM
(Select cas.debit_11 as debit_11,
cas.credit_13 as credit_13,
cas.debit_15 as debit_15,
cas.amount as amount,
cas.conversion_amount as conversion_amount,
cas.curr_cd as curr_cd,
CASE WHEN cas.appr_deny_cd in ('0','1','6') THEN 'Approved' WHEN cas.appr_deny_cd = '2' THEN 'System Denied' WHEN cas.appr_deny_cd = '3' THEN 'Authorizer Denied' WHEN cas.appr_deny_cd = '4' THEN 'System Pending' WHEN cas.appr_deny_cd = '5' THEN 'Auth Pending' WHEN cas.appr_deny_cd = '7' THEN 'Denied' WHEN cas.appr_deny_cd = '8' THEN 'Pending' WHEN cas.appr_deny_cd = '9' THEN 'Timeout - Reject' ELSE cas.appr_deny_cd END as approval_deny_cd,
CASE WHEN appr_deny_cd in ('0','1','6') then 'approved' ELSE 'declined' END as approval,
cas.sed10 as sed10,
cas.sed_pkey as sed_pkey,
cas.time_of_day_in as time_of_day_in,
cas.trans_dt as Auth_date,
cas.atm_terminal_id,
cas.atm_location_addr,
cas.atm_street_addr,
cas.atm_city_nm,
cas.atm_state_cd,
cas.atm_country_cd,
cas.atm_zip_cd,
cas.atm_country,
cas.trx_1,
cas.trx_2,
cas.trx_3,
cas.trx_4,
cas.trx_5,
cas.trx_6,
cas.trx_7,
cas.trx_8 from clouddb.transaction cas
where cas.trans_dt between DATE_SUB(current_date(),180) and current_date()) a
JOIN 
(select in_tb.batch_id,in_tb.sid,in_tb.cust_id,in_tb.acc_no,in_tb.credit_13,in_tb.from_date_filter,in_tb.to_date_filter from input_parameters in_tb) b
ON b.credit_13 = a.credit_13 where a.auth_date between NVL(b.from_date_filter,DATE_SUB(current_date(), 730)) and NVL(b.to_date_filter,current_date());
yhuiod9q

yhuiod9q1#

编写spark/hive查询时,请按照以下步骤提高性能。
1.使用分区
1.在上面的例子中,正如你提到的input_parameters表在任何时候都只有1 - 10条记录。尝试在所有执行器之间广播这个表。也就是说,如果还没有启用广播连接,请启用它。
1.避免数据扭曲,如果需要的话使用盐处理。同时尝试并行化数据。
1.尝试使用矢量化。
1.在配置单元中启用动态分区

set hive.vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled=true;
spark.sql.autoBroadcastJoinThreshold=true;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

1.访问此链接了解更多信息https://spark.apache.org/docs/latest/tuning.html

相关问题