基于周范围的窗口上的列值总和(impala)

6za6bjd0  于 2021-06-26  发布在  Impala
关注(0)|答案(1)|浏览(452)

给出如下表格:

client_id   date            connections
---------------------------------------
121438297   2018-01-03      0
121438297   2018-01-08      1
121438297   2018-01-10      3
121438297   2018-01-12      1
121438297   2018-01-19      7
363863811   2018-01-18      0
363863811   2018-01-30      5
363863811   2018-02-01      4
363863811   2018-02-10      0

我正在寻找一种有效的方法,将当前行(当前行包含在sum中)之后6天内发生的连接数相加,并按客户端的\u id进行分区,这将导致:

client_id   date            connections     connections_within_6_days
---------------------------------------------------------------------
121438297   2018-01-03      0               1        
121438297   2018-01-08      1               5     
121438297   2018-01-10      3               4     
121438297   2018-01-12      1               1                       
121438297   2018-01-19      7               7
363863811   2018-01-18      0               0
363863811   2018-01-30      5               9
363863811   2018-02-01      4               4
363863811   2018-02-10      0               0

问题:
我不想添加所有丢失的日期,然后执行一个滑动窗口来计算下面的7行,因为我的表已经非常大了。
我用 Impala 和 range between interval '7' days following and current row 不支持。
编辑:我正在寻找一个通用的答案,考虑到这一事实,我将需要改变窗口大小为更大的数字(例如30天以上)

zzwlnbp8

zzwlnbp81#

这回答了问题的原始版本。
Impala 并不完全支持 range between . 不幸的是,这并没有留下很多选择。一是使用 lag() 有很多明确的逻辑:

select t.*,
       ( (case when lag(date, 6) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections, 6) over (partition by client_id order by date)
               else 0
          end) +
         (case when lag(date, 5) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections, 5) over (partition by client_id order by date)
               else 0
          end) +
         (case when lag(date, 4) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections, 4) over (partition by client_id order by date)
               else 0
          end) +
         (case when lag(date, 3) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections, 3) over (partition by client_id order by date)
               else 0
          end) +
         (case when lag(date, 2) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections, 2) over (partition by client_id order by date)
               else 0
          end) +
         (case when lag(date, 1) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections, 1) over (partition by client_id order by date)
               else 0
          end) +
         connections
        ) as connections_within_6_days         
from t;

不幸的是,这并不能很好地概括。如果你想要一个大范围的天数,你可能想问另一个问题。

相关问题