sql:按批计算丢失的值

gmxoilav  于 2021-06-24  发布在  Hive
关注(0)|答案(3)|浏览(287)

我有一张table test 带列 Batch 以及 ID . 我想计算一下,与最早的批相比,每个批中缺少多少ID,比如比较第2批和第1批中下面第2批的值。

SELECT COUNT(T1.ID) AS LOST_CNT FROM
(SELECT * FROM TEST WHERE BATCH=1)T1
LEFT JOIN (SELECT * FROM TEST WHERE BATCH=2)T2
ON T1.ID=T2.ID WHERE T2.ID IS NULL

我想得到 lost_cnt 随着时间的推移,每批的数量都会增加。像下面这样的东西并不能返回我想要的东西(我明白为什么,只是把它作为失败的尝试放在这里)

SELECT A.BATCH,
COUNT(DISTINCT CASE WHEN A.ID IS NULL THEN M.ID ELSE NULL END) AS lost_cnt
FROM
 (SELECT DISTINCT ID FROM TEST WHERE BATCH=(SELECT MIN(BATCH) FROM TEST)) M 
LEFT JOIN TEST A ON M.ID=A.ID 
GROUP BY 1;

有办法得到我想要的吗?

rm5edbpk

rm5edbpk1#

你可以用 lag 用于查找上一批的分析函数,如果上一批中存在id,则使用 NOT EXISTS 具体如下:

SELECT T.BATCH, T.ID
  FROM ( SELECT T.BATCH, T.ID,
           LAG(BATCH) OVER( ORDER BY BATCH) AS PREV_BATCH
      FROM YOUR_TABLE T ) T
 WHERE NOT EXISTS (
    SELECT 1
      FROM YOUR_TABLE TT
     WHERE TT.BATCH = T.PREV_BATCH
       AND TT.ID = T.ID)
1bqhqjot

1bqhqjot2#

在hive中,我将使用窗口函数来实现这一点:

with firstbatch (
      select t.*, count(*) over () as num_in_first_batch
      from (select t.*,
                   min(batch) over () as min_batch
            from t
           ) t
      where min_batch = 1
     )
select t.batch,
       count(fb.id) as num_in_first_batch,
       (fb.num_in_first_batch - count(fb.id)) as num_missing_in_first_batch
from t left join
     first_batch fb
     on t.id = fb.id
group by t.batch, fb.num_in_first_batch;
13z8s7eq

13z8s7eq3#

现在还不完全清楚您想要实现什么,但是我猜您想要找到与第一批相比丢失了多少id。您只需使用第一批中的id筛选表,计算每个批中的id数,然后从第一批的计数中减去。

with t as (
    select *
    from test
    where id in (
        select id
        from test
        where batch = (select min(batch) from test)
    )
)
select
    batch,
    (select count(distinct id)
     from t
     where batch = (select min(batch) from test)
    ) - count(distinct id) as missing
from t
group by batch
order by batch;

样本数据:

batch   id
1       1
1       2
1       3
2       2
2       3
2       4
3       3
3       4

结果:

batch   missing
1       0
2       1
3       2

相关问题