我有以下几点 transaction
表格:
我想计算每种产品的总采购量:
product category
(即同一产品内所有产品的总数量) category
) department
(即同一产品内所有产品的总数量) department
)
此外,应计算上述总额:
每位顾客
每个家庭(同一家庭中所有购物者的总数量之和)。
输出表应如下所示:
对于家庭来说,总数计算一次,然后“复制”给同一家庭的每个购物者。
计算 product
/ category
/ department
隔着table,我在用 GROUPING SETS
正如我在前面的问题中指出的。所以我得到了 total_quantity_individual
正确的。
为了 total_quantity_family
,使用 OVER(PARTITION BY)
正如这里在一张简单得多的table上指出的。
但是,我不知道如何将两者结合起来。关于组合的信息不多 OVER(PARTITION BY)
与 GROUPING SETS
.
我的查询如下所示:
SELECT
family_id,
shopper_id,
CASE
WHEN GROUPING__ID = 6 THEN 'department'
WHEN GROUPING__ID = 5 THEN 'category'
WHEN GROUPING__ID = 3 THEN 'product'
END AS total_level_type,
CASE
WHEN GROUPING__ID = 6 THEN department
WHEN GROUPING__ID = 5 THEN category
WHEN GROUPING__ID = 3 THEN product
END AS id,
SUM(quantity) AS total_quantity_shopper
-- sum(sum(quantity)) OVER (PARTITION BY family_id, product) AS total_quantity_family
FROM
transaction
GROUP BY
family_id,
shopper_id,
product,
category,
department
GROUPING SETS (
(family_id, shopper_id, product),
(family_id, shopper_id, category),
(family_id, shopper_id, department)
)
ORDER BY
total_level_type;
如果 OVER(PARTITION BY)
不适用于我的情况,我的其他选择可能是:
组 transaction
由 family_id
,然后对结果运行分组集,然后使用 transaction
.
也许是使用explode()和侧视图的技巧?
出于可维护性的原因,我真的不想将单个版本和系列版本之间的查询分开。
注意:如果有帮助的话,我将sparksql与hive上下文一起使用。
谢谢你的帮助。谢谢!
编辑:这似乎有效:
...
SUM(quantity) AS total_quantity_shopper,
CASE
WHEN GROUPING__ID = 6 THEN sum(sum(quantity)) OVER (PARTITION BY family_id, department)
WHEN GROUPING__ID = 5 THEN sum(sum(quantity)) OVER (PARTITION BY family_id, category)
WHEN GROUPING__ID = 3 THEN sum(sum(quantity)) OVER (PARTITION BY family_id, product)
END AS total_quantity_family
...
1条答案
按热度按时间3okqufwl1#
使用多个
sum() over()
不同的partition by
条款: