在sql(配置单元)中为每个具有条件的id随机选择行

w41d8nur  于 2021-06-25  发布在  Hive
关注(0)|答案(1)|浏览(387)

下面是一个表,其中有85m行,具有~7m个不同的产品标识。日期范围为2019年1月1日至2019年10月31日(共10个月)。

product_id    status     date
60151028       LESS    2019-04-04
86069378       MORE    2019-08-12
99660145       LESS    2019-01-06
16775944       LESS    2019-05-12
52801157       LESS    2019-09-06
4493185        LESS    2019-09-13
39917883       LESS    2019-07-03
87716367       MORE    2019-05-24
90785856       MORE    2019-05-15
15412972       SAME    2019-07-11
12236638       SAME    2019-03-23
.
.
.
.
60151028       LESS    2019-10-12

我想为我的分析生成一个随机样本,并为每个id和状态选择4个随机行——例如,我想:
id为60151028且状态为“更少”的4行
4行,id为60151028,状态为“更多”
id为60151028的4行status=“same”和similar与所有产品id匹配——这意味着有12行具有相同的产品id。
数据应在这10个月内的所有日期中随机选取。以下是结果视图:

product_id  status      date
60151028       LESS    2019-04-04
60151028       LESS    2019-07-18
60151028       LESS    2019-09-23
60151028       LESS    2019-01-25
60151028       SAME    2019-05-14
60151028       SAME    2019-10-11
60151028       SAME    2019-03-31
60151028       SAME    2019-08-30
60151028       MORE    2019-04-27
60151028       MORE    2019-09-19
60151028       MORE    2019-10-29
60151028       MORE    2019-06-04
.
.
.
39917883       SAME    2019-08-03
39917883       SAME    2019-02-26
39917883       SAME    2019-10-07
.
.
.

我一直在思考,直到下面的查询(Hive),但我无法想到如何随机选择行的状态=“少”或“多”,等任何帮助将不胜感激!!
另外,我知道有一个rand()函数是hivesql,但我不知道如何在这里使用它。

select max(case when status = 'LESS' then 1 else 0 end) OVER (PARTITION by product_id) as flg
,detail1.*
from
(
select row_number() OVER
(partition by product_id order by date) as rn
from db.table 
) detail1

UNION ALL

select max(case when status = 'SAME' then 1 else 0 end) OVER (PARTITION by product_id) as flg
,detail2.*
from
(
select row_number() OVER
(partition by product_id order by date) as rn
from db.table  
) detail2

UNION ALL

select max(case when status = 'MORE' then 1 else 0 end) OVER (PARTITION by product_id) as flg
,detail3.*
from
(
select row_number() OVER
(partition by product_id order by date) as rn
from db.table  
) detail3
balp4ylt

balp4ylt1#

使用 partition by :

select t.*
from (select t.*
             row_number() OVER (partition by product_id, status order by rand()) as seqnum
      from db.table  
     ) t
where seqnum <= 4;

编辑:
要获取所有状态至少有4行的示例:

select t.*
from (select t.*
             row_number() OVER (partition by product_id, status order by rand()) as seqnum,
             max(ps_cnt) over (partition by product_id) as max_ps_cnt
      from (select t.*, count(*) over (partition by product_id, status) as ps_cnt
            from db.table t
           ) t
     ) t
where seqnum <= 4 and max_ps_cnt >= 4;

相关问题