hive 获取每组的完整行集

idfiyjo8  于 2023-08-04  发布在  Hive
关注(0)|答案(2)|浏览(130)

编辑将扩大此范围以包括R标记

你好
我有一个按id和月份排序的表,与下面的表类似,除了它有数百万个id's和数百个var列,month列可以是2023年的任何一个月。我需要填写每个ID缺失的月份,并添加一列,指示该ID在特定月份是否处于活动状态。如果id在表中没有特定月份的记录,则假定它们处于非活动状态。所需表中的总行数将是12* 个不同的id。我还为示例输入添加了所需的输出。对于R解决方案,我更喜欢tidyverse解决方案,但data.table选项也可以工作,因为它会更快,我只是可能不擅长修改它。任何帮助都是感激的。
以下是R的dputs

input =structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L), month = c(202301L, 202304L, 202305L, 
202301L, 202302L, 202303L, 202304L, 202305L, 202306L, 202307L, 
202308L, 202309L, 202310L, 202311L, 202312L, 202307L, 202308L
), var1 = c(1L, 2L, 3L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 
14L, 15L, 16L, 17L, 18L), var2 = 1:17), class = "data.frame", row.names = c(NA, 
-17L))

output = structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), month = c(202301L, 202302L, 
202303L, 202304L, 202305L, 202306L, 202307L, 202308L, 202309L, 
202310L, 202311L, 202312L, 202301L, 202302L, 202303L, 202304L, 
202305L, 202306L, 202307L, 202308L, 202309L, 202310L, 202311L, 
202312L, 202301L, 202302L, 202303L, 202304L, 202305L, 202306L, 
202307L, 202308L, 202309L, 202310L, 202311L, 202312L), var1 = c(1L, 
NA, NA, 2L, 3L, NA, NA, NA, NA, NA, NA, NA, 5L, 6L, 7L, 8L, 9L, 
10L, 11L, 12L, 13L, 14L, 15L, 16L, NA, NA, NA, NA, NA, NA, 17L, 
18L, NA, NA, NA, NA), var2 = c(1L, NA, NA, 2L, 3L, NA, NA, NA, 
NA, NA, NA, NA, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 
15L, NA, NA, NA, NA, NA, NA, 16L, 17L, NA, NA, NA, NA), active = c(1L, 
0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 
0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -36L))

字符串
样品输入:

+----+--------+------+------+
| id | month  | var1 | var2 |
+----+--------+------+------+
|  1 | 202301 |    1 |    1 |
|  1 | 202304 |    2 |    2 |
|  1 | 202305 |    3 |    3 |
|  2 | 202301 |    5 |    4 |
|  2 | 202302 |    6 |    5 |
|  2 | 202303 |    7 |    6 |
|  2 | 202304 |    8 |    7 |
|  2 | 202305 |    9 |    8 |
|  2 | 202306 |   10 |    9 |
|  2 | 202307 |   11 |   10 |
|  2 | 202308 |   12 |   11 |
|  2 | 202309 |   13 |   12 |
|  2 | 202310 |   14 |   13 |
|  2 | 202311 |   15 |   14 |
|  2 | 202312 |   16 |   15 |
|  3 | 202307 |   17 |   16 |
|  3 | 202308 |   18 |   17 |
+----+--------+------+------+


预期输出:

+----+--------+------+------+--------+
| id | month  | var1 | var2 | active |
+----+--------+------+------+--------+
|  1 | 202301 |    1 |    1 |      1 |
|  1 | 202302 |      |      |      0 |
|  1 | 202303 |      |      |      0 |
|  1 | 202304 |    2 |    2 |      1 |
|  1 | 202305 |    3 |    3 |      1 |
|  1 | 202306 |      |      |      0 |
|  1 | 202307 |      |      |      0 |
|  1 | 202308 |      |      |      0 |
|  1 | 202309 |      |      |      0 |
|  1 | 202310 |      |      |      0 |
|  1 | 202311 |      |      |      0 |
|  1 | 202312 |      |      |      0 |
|  2 | 202301 |    5 |    4 |      1 |
|  2 | 202302 |    6 |    5 |      1 |
|  2 | 202303 |    7 |    6 |      1 |
|  2 | 202304 |    8 |    7 |      1 |
|  2 | 202305 |    9 |    8 |      1 |
|  2 | 202306 |   10 |    9 |      1 |
|  2 | 202307 |   11 |   10 |      1 |
|  2 | 202308 |   12 |   11 |      1 |
|  2 | 202309 |   13 |   12 |      1 |
|  2 | 202310 |   14 |   13 |      1 |
|  2 | 202311 |   15 |   14 |      1 |
|  2 | 202312 |   16 |   15 |      1 |
|  3 | 202301 |      |      |      0 |
|  3 | 202302 |      |      |      0 |
|  3 | 202303 |      |      |      0 |
|  3 | 202304 |      |      |      0 |
|  3 | 202305 |      |      |      0 |
|  3 | 202306 |      |      |      0 |
|  3 | 202307 |   17 |   16 |      1 |
|  3 | 202308 |   18 |   17 |      1 |
|  3 | 202309 |      |      |      0 |
|  3 | 202310 |      |      |      0 |
|  3 | 202311 |      |      |      0 |
|  3 | 202312 |      |      |      0 |
+----+--------+------+------+--------+

zz2j4svz

zz2j4svz1#

with ids as (
    select distinct id from T
), months as (
    -- assuming there's at least one active id every month
    select distinct month from T
    -- where month between '202301' and '202312'
)
select i.id, m.month, t.var1, t.var2,
    case when t.id is null then 1 else 0 end as active
from ids i cross join months m left outer join T t
    on t.id = i.id and t.month = m.month;

字符串
老实说,我真的不知道什么特别是关于Hive,但这是一般的想法,以“致密化”您的结果。缺失行的输出将为空。

rdlzhqv9

rdlzhqv92#

若要将缺少的月份添加到表中,可以将INSERT语句与生成缺少的月份的SELECT语句沿着使用。下面是一个将缺少的月份插入DOCTOR_CHECKUP表的示例:

INSERT INTO DOCTOR_CHECKUP (CHECKED_DATE)
SELECT TO_CHAR(ADD_MONTHS(DATE '2016-01-01', LEVEL - 1), 'MM-YYYY') MTH
FROM DUAL
CONNECT BY LEVEL <= 24
MINUS
SELECT TO_CHAR(TO_DATE(CHECKED_DATE, 'DD-MON-YY'), 'MM-YYYY')
FROM DOCTOR_CHECKUP;

字符串
此查询将缺失的月份插入DOCTOR_CHECKUP表的CHECKED_DATE列。您可能需要调整列名和日期格式以匹配表结构。
可以使用子查询从同一表或不同表的另一列中选择数据,并将其包含在INSERT语句中。下面是一个例子:

INSERT INTO DOCTOR_CHECKUP (CHECKED_DATE, ANOTHER_COLUMN)
SELECT TO_CHAR(ADD_MONTHS(DATE '2016-01-01', LEVEL - 1), 'MM-YYYY') MTH, ANOTHER_COLUMN
FROM ANOTHER_TABLE
CONNECT BY LEVEL <= 24
MINUS
SELECT TO_CHAR(TO_DATE(CHECKED_DATE, 'DD-MON-YY'), 'MM-YYYY'), ANOTHER_COLUMN
FROM DOCTOR_CHECKUP;


在此示例中,INSERT语句中包含ANOTHER_TABLE表中ANOTHER_COLUMN列的数据。可以将ANOTHER_TABLE替换为包含要包括的列的表的名称,将ANOTHER_COLUMN替换为要包括的列的名称。

相关问题