将排序后的数据与sql合并

tjvv9vkg  于 2021-08-09  发布在  Java
关注(0)|答案(2)|浏览(719)

我有这样的数据:

+---+----+----------+--------+
| id|hash|start_date|end_date|
+---+----+----------+--------+
|  1|   a|      2012|    2013|
|  1|   b|      2014|    2015|
|  1|   a|      2016|    2017|
|  1|   a|      2018|    2019|
+---+----+----------+--------+

我想合并具有相同值的句点。因此,我希望得到如下数据:

+---+----+----------+--------+
| id|hash|start_date|end_date|
+---+----+----------+--------+
|  1|   a|      2012|    2013|
|  1|   b|      2014|    2015|
|  1|   a|      2016|    2019|
+---+----+----------+--------+

(最后两行合并为一个句点)
我试过这样一个问题:

%sql
select distinct 
 id, 
 hash,  
 min(start_date)  over(partition by hash) as start_date,  
 max(end_date) over(partition by hash) as  end_date 
from (
 select 1 as id, 'a' as hash, 2012 as start_date, 2013 as end_date
  union 
 select 1 as id, 'b' as hash, 2014 as start_date, 2015 as end_date
  union 
 select 1 as id, 'a' as hash, 2016 as start_date, 2017 as end_date
  union 
 select 1 as id, 'a' as hash, 2018 as start_date, 2019 as end_date
) t

结果是

+---+----+----------+--------+
| id|hash|start_date|end_date|
+---+----+----------+--------+
|  1|   a|      2012|    2019|
|  1|   b|      2014|    2015|
+---+----+----------+--------+

这是错误的,因为2012-2013年和2016-2019年应该分开。
如何使用spark sql获得正确的结果?

izj3ouym

izj3ouym1#

这是一个缺口和岛屿问题。最简单的方法是行号之差。如果没有间隙,这将起作用:

select id, hash, min(start_date) as start_date, max(end_date) as max_end_date
from (select t.*,
             row_number() over (partition by id, hash order by start_date) as seqnum_h,
             row_number() over (partition by id order by start_date) as seqnum
      from t
     ) t
group by id, hash, (seqnum - seqnum_h)
h79rfbju

h79rfbju2#

这是一个缺口和孤岛问题。这里有一种方法 lag() 还有一扇Windows sum 定义组。这种方法的优点是,它允许在不同的时间序列上同时出现周期 id s。
考虑:

select id, hash, min(start_date) start_date, max(end_date) end_date
from (
    select
        t.*,
        sum(case when start_date = lag_end_date + 1 then 0 else 1 end)
            over(partition by id, hash order by end_date) grp
    from (
        select 
            t.*, 
            lag(end_date) over(partition by id, hash order by end_date) lag_end_date
        from mytable t
    ) t
) t
group by id, hash, grp
order by id, min(start_date)

相关问题