我想使用sparksql窗口函数来做一些聚合和窗口处理。
假设我使用这里提供的示例表a:https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
我想运行查询,给我的最大2收入为每个类别,也为每个类别的产品计数。
在我运行这个查询之后
SELECT
product,
category,
revenue
FROM (
SELECT
product,
category,
revenue,
dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank
count(*) OVER (PARTITION BY category ORDER BY revenue DESC) as count
FROM productRevenue) tmp
WHERE
rank <= 2
我的table是这样的:
product category revenue count
pro2 tablet 6500 1
mini tablet 5500 2
而不是
product category revenue count
pro2 tablet 6500 5
mini tablet 5500 5
这正是我所期望的。
我应该如何编写代码来获得每个类别的正确计数(而不是使用另一个单独的groupby语句)?
2条答案
按热度按时间qybjjes11#
在
Spark
具有order by window的if window子句默认为ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
.为您的案件添加
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
在count()窗口子句中。Try with:
```SELECT
product,
category,
revenue,count
FROM (
SELECT
product,
category,
revenue,
dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank,
count() OVER (PARTITION BY category ORDER BY revenue DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as count
FROM productRevenue) tmp
WHERE
rank <= 2
ktca8awb2#
改变
count(*) OVER (PARTITION BY category ORDER BY revenue DESC) as count
至count(*) OVER (PARTITION BY category ORDER BY category DESC) as count
. 你会得到预期的结果。试试下面的代码。