如何计算hiveql中每个城市最流行的设备、操作系统和浏览器?

hkmswyz6  于 2021-06-27  发布在  Hive
关注(0)|答案(2)|浏览(228)

我有一个带有用户代理字符串的表(我将其解析为 browser , os ,和 device 列)和城市 id 的。我想计算最流行的 browser , os 以及 device 对于每个 city .
以下是我的尝试:

select device os, browser, name, MAX(hits) as pop from 
(select uap.device, uap.os, uap.browser, name, COUNT(*) as hits 
from (select * from browserdata join citydata on cityid=id) t 
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser 
GROUP BY uap.device, uap.os, uap.browser, name) t2 
GROUP BY name;

所以,最里面的子查询,别名 t 只需将我的表连接到另一个Map id 去城市的路怎么走 name s、 所以我可以看到真实的 name s、 而不是城市 id 在输出中。
然后,名为 t2 统计复合键的数量( device , browser , os , city ). 外部查询将所有内容分组 name 并提取具有最大用户数的行。
我得到的错误是:
失败:semanticexception[错误10025]:行1:7表达式不在按键“device”分组的组中
我明白这意味着什么。上面说我需要包括 device 进入 group by ,但如果我这么做了,那就不是在计算我想要什么。如何修复查询?
另外,我注意到我的一些配置单元查询在mapreduce上运行,但在tez上不运行。为什么?

56lgkhnf

56lgkhnf1#

WITH t1 as 
(select * from browserdata join citydata on cityid=id),

t2 as 
(select uap.device as device, uap.os as os, uap.browser as browser, name as cityname 
from t1 
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser),

t3 as
  (SELECT t2.cityname as cityname, t2.device as device, t2.browser as browser, t2.os as os, COUNT(*) as count FROM t2 GROUP BY t2.cityname, t2.os, t2.device, t2.browser),

t4 as
    (select cityname, MAX(count) as maximum from t3 group by cityname)

select t4.cityname, t4.maximum, t3.device, t3.os, t3.browser
from t4 join t3 on t4.cityname=t3.cityname and t4.maximum=t3.count;

这是可行的,但我想知道是否有一种方法来优化它。。。

iqxoj9l9

iqxoj9l92#

使用分析函数可以消除不必要的连接:

WITH 
t1 as 
(select * from browserdata join citydata on cityid=id),

t2 as 
(select uap.device as device, uap.os as os, uap.browser as browser, name as cityname 
from t1 
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser),

t3 as
(select t2.cityname as cityname, t2.device as device, t2.browser as browser, t2.os as os, count(*) as count from t2 group by t2.cityname, t2.os, t2.device, t2.browser)

select cityname, maximum,  device, os, browser
 from
     (select cityname, device, browser, os, 
             max(count) over(partition by cityname)                         as maximum,
             dense_rank() over (partition by cityname order by count desc ) as rnk      
      from t3
     ) s  where rnk =1 
;

相关问题