hive:指定组上的和(hiveql)

9cbw7uwe  于 2021-06-04  发布在  Hadoop
关注(0)|答案(6)|浏览(373)

我有一张table:

key    product_code    cost
1      UK              20
1      US              10
1      EU              5
2      UK              3
2      EU              6

我想找到所有产品的总和为每组“键”,并附加到每一行。例如,对于key=1,找到所有产品的成本总和(20+10+5=35),然后将结果附加到key=1对应的所有行。所以最终结果是:

key    product_code    cost     total_costs
1      UK              20       35
1      US              10       35
1      EU              5        35
2      UK              3        9
2      EU              6        9

我更喜欢不使用子连接来执行此操作,因为这样效率会很低。我最好的办法是使用 over 功能与 sum 但我不能让它工作。我的最佳尝试:

SELECT key, product_code, sum(costs) over(PARTITION BY key)
FROM test
GROUP BY key, product_code;

我看了一下文件,但有这么神秘我不知道该怎么做。im使用HiveV0.12.0、HDPv2.0.6和HortonWorksHadoop发行版。

w8ntj3qf

w8ntj3qf1#

分析函数sum给出累计和。例如,如果您这样做了:

select key, product_code, cost, sum(cost) over (partition by key) as total_costs from test

然后你会得到:

key    product_code    cost     total_costs
1      UK              20       20
1      US              10       30
1      EU              5        35
2      UK              3        3
2      EU              6        9

这似乎不是你想要的。
相反,您应该使用聚合函数sum,并结合自联接来完成以下任务:

select test.key, test.product_code, test.cost, agg.total_cost
from (
  select key, sum(cost) as total_cost
  from test
  group by key
) agg
join test
on agg.key = test.key;
9lowa7mx

9lowa7mx2#

你可以用 BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW 不需要自我连接就可以实现。
代码如下:

SELECT a, SUM(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM T;
ct3nt3jp

ct3nt3jp3#

类似的答案(如果我们使用oracle emp表): select deptno, ename, sal, sum(sal) over(partition by deptno) from emp; 输出如下:

deptno  ename   sal sum_window_0
10  MILLER  1300    8750
10  KING    5000    8750
10  CLARK   2450    8750
20  SCOTT   3000    10875
20  FORD    3000    10875
20  ADAMS   1100    10875
20  JONES   2975    10875
20  SMITH   800     10875
30  BLAKE   2850    9400
30  MARTIN  1250    9400
30  ALLEN   1600    9400
30  WARD    1250    9400
30  TURNER  1500    9400
30  JAMES   950     9400
uyto3xhc

uyto3xhc4#

上面的table看起来像

key    product_code    cost
1      UK              20
1      US              10
1      EU              5
2      UK              3
2      EU              6

用户想要一个总成本如下的tabel

key    product_code    cost     total_costs
1      UK              20       35
1      US              10       35
1      EU              5        35
2      UK              3        9
2      EU              6        9

因此,我们使用以下查询

SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;

到现在为止,一直都还不错。我还要一个专栏,统计每个国家的发生率

key    product_code    cost     total_costs     occurences
1      UK              20       35              2
1      US              10       35              1
1      EU              5        35              2
2      UK              3        9               2
2      EU              6        9               2

因此,我使用了以下查询

SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as total_costs
COUNT(product code) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as occurences
FROM test;

可悲的是,这不起作用。我犯了个神秘的错误。为了排除查询中的错误,我想问一下我是否做错了什么。谢谢

h9vpoimq

h9vpoimq5#

这个问题给了我完美的结果 select key, product_code, cost, sum(cost) over (partition by key) as total_costs from zone;

5t7ly7z5

5t7ly7z56#

与@vb\uanswer类似,使用 BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 声明。
因此,hiveql查询是:

SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;

相关问题