如何在重叠的时段上使用多个计数(不同的…)优化配置单元查询？

8ljdwjyq 于 2021-06-02 发布在 Hadoop

关注(0)|答案(2)|浏览(364)

我有一个巨大的段落Hive查询代码如下：

select 
count(distinct case when click_day between ${hiveconf:dt_180}  and ${hiveconf:dt_end} and recommend_flag=1  then productid else null end) as unique_hk_products_cnt_180d,
count(distinct case when click_day between ${hiveconf:dt_90}  and ${hiveconf:dt_end}  and recommend_flag=1 then productid else null end) as unique_hk_products_cnt_90d,
count(distinct case when click_day between ${hiveconf:dt_30}  and ${hiveconf:dt_end}  and recommend_flag=1 then productid else null end) as unique_hk_products_cnt_30d,
count(distinct case when click_day between ${hiveconf:dt_15}  and ${hiveconf:dt_end}  and recommend_flag=1 then productid else null end) as unique_hk_products_cnt_15d,
count(distinct case when click_day between ${hiveconf:dt_7}  and ${hiveconf:dt_end}  and recommend_flag=1 then productid else null end) as unique_hk_products_cnt_7d
from mytable ;

这些字段之间的唯一区别是天数，它表示时间窗口的长度。这使得我的问题非常大，很难出错。
dt\u 15只是之前定义的字符串变量：

set dt_15 = CONCAT(SUBSTRING(date_sub(current_date,15), 1, 4), SUBSTRING(date_sub(current_date,15), 6, 2), SUBSTRING(date_sub(current_date,15), 9, 2));

有人能帮我把它重建得更简单吗？比如在新表中使用循环生成字段？
谢谢。

sql hadoop Hive

来源：https://stackoverflow.com/questions/42291030/how-to-optimize-a-hive-query-with-multiple-countdistinct-on-overlapping-per

2条答案

按热度按时间

yyyllmsg1#

试试这个：使用内置日期函数

set dt_15 = from_unixtime(unix_timestamp(date_sub(current_date,15),'yyyy-mm-dd'),'yyyymmdd')

因为这将删除concat和substring操作。

select 
count(case when click_day between ${hiveconf:dt_180}  and ${hiveconf:dt_end}   then productid else null end) as unique_hk_products_cnt_180d,
count(case when click_day between ${hiveconf:dt_90}  and ${hiveconf:dt_end}  then productid else null end) as unique_hk_products_cnt_90d,
count(case when click_day between ${hiveconf:dt_30}  and ${hiveconf:dt_end}  then productid else null end) as unique_hk_products_cnt_30d,
count(case when click_day between ${hiveconf:dt_15}  and ${hiveconf:dt_end}   then productid else null end) as unique_hk_products_cnt_15d,
count(case when click_day between ${hiveconf:dt_7}  and ${hiveconf:dt_end}   then productid else null end) as unique_hk_products_cnt_7d
from (select distinct click_day,productid where recommend_flag = 1 ) tmp ;

这将减少输入量。你也可以把 click_day < dt_end 如果对所有列都相同，则删除中间的。

赞(0）回复(0）举报 2021-06-02

qnzebej02#

试试这个

select  count (case when click_day between ${hiveconf:dt_180} and ${hiveconf:dt_end} then productid end) as unique_hk_products_cnt_180d
       ,count (case when click_day between ${hiveconf:dt_90}  and ${hiveconf:dt_end} then productid end) as unique_hk_products_cnt_90d
       ,count (case when click_day between ${hiveconf:dt_30}  and ${hiveconf:dt_end} then productid end) as unique_hk_products_cnt_30d
       ,count (case when click_day between ${hiveconf:dt_15}  and ${hiveconf:dt_end} then productid end) as unique_hk_products_cnt_15d
       ,count (case when click_day between ${hiveconf:dt_7}   and ${hiveconf:dt_end} then productid end) as unique_hk_products_cnt_7d

from   (select  click_day,recommend_flag,productid
               ,row_number() over 
                (
                    partition by    productid
                    order by        click_day desc    
                )   as rn

        from    mytable

        where   click_day between ${hiveconf:dt_180} and ${hiveconf:dt_end} 
            and recommend_flag=1
        ) t

where   rn = 1

附笔
你为什么要用非标准格式存储日期？

赞(0）回复(0）举报 2021-06-02

我来回答

如何在重叠的时段上使用多个计数(不同的…)优化配置单元查询？

2条答案

相关问题

热门标签

最新问答