sql—使用中断配置单元捕获连续年份范围

thigvfpy  于 2021-06-24  发布在  Hive
关注(0)|答案(2)|浏览(356)

我正在尝试在配置单元中编写一个查询,以返回具有年份范围的数据,如果这些数据是连续的年份以及间隔年份,如果这些年份之间存在间隔。
我正想弄清楚,但似乎找不到实现结果的逻辑。Hive逻辑是如何工作的。请帮忙。
输入

group_no            year 
1111                2003
1111                2004
1111                2005
1111                2008
1111                2010
1111                2011
1111                2012
2222                2015
3333                2014
3333                2015
3333                2017
3333                2019
4444                2010
4444                2012

输出:

group_no year
1111    [2003-2005,2008,2010-2012]
2222    [2015]
3333    [2014-2015,2017,2019]
4444    [2010,2012]
k4ymrczo

k4ymrczo1#

新范围开始于 (year - prev_year) > 1 or (prev_year is NULL) ,您可以将当前年度作为新范围的第一年。将第一年分配给所有行,然后计算每组的最后一年 (group_no, first_year) .

with my_data as(
    select stack(14,
    1111, 2003,
    1111, 2004,
    1111, 2005,
    1111, 2008,
    1111, 2010,
    1111, 2011,
    1111, 2012,
    2222, 2015,
    3333, 2014,
    3333, 2015,
    3333, 2017,
    3333, 2019,
    4444, 2010,
    4444, 2012  
    ) as (group_no, year)
    )   

select group_no, array_sort(collect_list(case when first_year=last_year then first_year else concat(first_year,'-',last_year) end)) as year
from
(--calculate last_year
select s.group_no, s.first_year, max(year) last_year      
from
(
select group_no, year, 
       --New range starts when (year - prev_year) > 1 or (prev_year is NULL)
       --Calculate first_year for every row
       max(case when (year - prev_year) = 1 then NULL else year end) over(partition by group_no order by year rows between unbounded preceding and current row ) first_year
  from
(
select d.*,
       lag(year) over(partition by group_no order by year) prev_year
  from my_data d
)s  
)s
group by s.group_no, s.first_year
)s
group by group_no 
order by group_no

结果:

group_no  year
1111  ["2003-2005","2008","2010-2012"]
2222  ["2015"]
3333  ["2014-2015","2017","2019"]
4444  ["2010","2012"]
w8f9ii69

w8f9ii692#

这是一个间隙和孤岛问题,您需要将具有相同属性的行组合在一起 group_no 谁的 year s是连续的。
下面是一种使用窗口函数的方法:其思想是使用 row_number() 以及 year 建立团队。然后可以为每组相邻记录聚合一次,最后按 group_no .

select 
    group_no, 
    collect_list(
        case when min_year <> max_year 
            then concat(min_year, '-', max_year)
            else min_year
        end
    ) year
from (
    select group_no, min(year) min_year, max(year) max_year
    from (
        select  t.*, row_number() over(partition by group_no order by year) rn
        from mytable t
    ) t
    group by group_no, year - rn
) t
group by group_no

我不确定Hive是否支持 order bycollect_list() 作为一个聚合函数-当作为窗口函数使用时,它看起来是这样的,所以这可能更好:

select distinct 
    group_no, 
    collect_list(
        case when min_year <> max_year 
            then concat(min_year, '-', max_year)
            else min_year
        end
    ) over(
        partition by group_no 
        order by min_year
        rows between unbounded preceding and unbounded following
    ) year
from (
    select group_no, min(year) min_year, max(year) max_year
    from (
        select  t.*, row_number() over(partition by group_no order by year) rn
        from mytable t
    ) t
    group by group_no, year - rn
) t

相关问题