HDFS Hive3中的连续滚动时间窗口查询

7cjasjjr 于 2022-12-09 发布在 HDFS

关注(0)|答案(1)|浏览(196)

我试图弄清楚如何使用移动时间窗口聚合查询日期和值的表（配置单元3）。在下面的示例中，我希望尽可能收集每两天的窗口（其中每个非终止日期将使用两次）。
样本数据
| 首次日期|价值|
| - -|- -|
| 2020年1月1日|三个|
| 2020年1月2日|四个|
| 2020年1月3日|五个|
| 2020年1月4日|六个|
所需输出（每两天窗口合并）
| 首次日期|总计|
| - -|- -|
| 2020年1月1日|七个|
| 2020年1月2日|九个|
| 2020年1月3日|十一|
| 2020年1月4日|六个|
我试过类似的东西，但没有运气

select
  first_date,
  sum(value) over(
    partition by first_date
    range between first_date and first_date + interval '1' day
) as total

显然，我不允许在range子句中使用分区列（日期），这有点不方便。我可以复制日期列来绕过这个限制，但可能有更好的方法。我还能尝试什么来使它工作呢？
(Also实际上，任何一个日期都可能有许多示例，因此尝试计算相邻行的数量是不可靠的）

hdfs

来源：https://stackoverflow.com/questions/72310325/continuous-rolling-temporal-window-query-in-hive3

1条答案

按热度按时间

o8x7eapl1#

区域语法真的很奇怪。您不能在区域中使用列。您可以指定排序依据中的列，然后定义应用于该列的区域。
由于您有日期，请将日期转换为unix时间戳，并添加86，400作为1天的范围。这很难看，但我认为没有其他选择。（使用CTE运行此操作非常慢，至少在我们的环境中是如此）

with cte as(
select unix_timestamp(cast('2020-01-01' as date),'yyyy-mm-dd') as first_date,3 as value
union select unix_timestamp(cast('2020-01-02' as date),'yyyy-mm-dd'),4
UNION select unix_timestamp(cast('2020-01-03' as date),'yyyy-mm-dd'),5
UNION select unix_timestamp(cast('2020-01-04' as date),'yyyy-mm-dd'),6
)
select 
from_Unixtime(first_date),
sum(value) over ( 
        --partition by from_Unixtime(first_date)
        order by first_date
        range between current row and 86400 following)
from
cte

赞(0）回复(0）举报 2022-12-09

我来回答

HDFS Hive3中的连续滚动时间窗口查询

1条答案

相关问题

热门标签

最新问答