基于配置单元表中的其他表值更新列值

yrdbyhpb  于 2021-06-24  发布在  Hive
关注(0)|答案(2)|浏览(274)

我在 hive 里有两张像下面这样的table stg . 此表主要是快照表,每天都会被覆盖
此表数据将插入到 history 纽约每天的餐桌 partition Day 1 stg

+-----+------------+------------+
| pk  | from_d     | to_d       |
+-----+------------+------------+
| 111 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+
| 222 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+
| 333 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+
``` `history` 此表按列进行分区 `load_date` ```
+-----+------------+------------+------------+
| pk  | from_d     | to_d       |load_date   |
+-----+------------+------------+------------+
| 111 | 2019-01-01 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 222 | 2019-01-01 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 333 | 2019-01-01 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+------------+

问题陈述:
1) 如果我收到任何 PK 已经存在于 history 然后我需要更新表 to_d 历史上那个pk的专栏。
2) 那个 to_d 列的值应为 from_d - 1 day 价值观 STG table。
3) 还需要考虑 PK 如果在另一天再次出现,那么更新应该只发生在历史上最新的记录上,而不是相同pk的所有记录。
请检查pk 111 在下面的数据示例中。
Day 2 stg+-----+------------+------------+ | pk | from_d | to_d | +-----+------------+------------+ | 111 | 2019-02-02 | 2019-02-02 | +-----+------------+------------+ | 333 | 2019-02-02 | 2019-02-02 | +-----+------------+------------+ | 444 | 2019-02-02 | 2019-02-02 | +-----+------------+------------+history 如下所示更新表格

+-----+------------+------------+------------+
| pk  | from_d     | to_d       | load_date  |
+-----+------------+------------+------------+
| 111 | 2019-01-01 | 2019-02-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 222 | 2019-01-01 | 2019-02-02 | 2019-01-01 |
+-----+------------+------------+------------+
| 333 | 2019-01-01 | 2019-02-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 111 | 2019-02-02 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+------------+
| 333 | 2019-02-02 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+------------+
| 444 | 2019-02-02 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+------------+

为了实现上述目标,我首先使用

insert overwrite table histroy partition(load_date) 
    select pk, from_d, 
    case when pk = '111' then '2019-02-01' when pk = '333' then '2019-02-01' else to_d end as to_d, 
    load_date 
from history;

完成后,我将第2天stg表插入历史表
Day 3 stg+-----+------------+------------+ | pk | from_d | to_d | +-----+------------+------------+ | 111 | 2019-03-03 | 2019-03-03 | +-----+------------+------------+ | 222 | 2019-03-03 | 2019-03-03 | +-----+------------+------------+ | 555 | 2019-03-03 | 2019-03-03 | +-----+------------+------------+history 日期如下

+-----+------------+------------+------------+
| pk  | from_d     | to_d       | load_date  |
+-----+------------+------------+------------+
| 111 | 2019-01-01 | 2019-02-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 222 | 2019-01-01 | 2019-03-02 | 2019-01-01 |
+-----+------------+------------+------------+
| 333 | 2019-01-01 | 2019-02-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 111 | 2019-02-02 | 2019-03-02 | 2019-02-02 |
+-----+------------+------------+------------+
| 333 | 2019-02-02 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+------------+
| 444 | 2019-02-02 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+------------+
| 111 | 2019-03-03 | 2019-03-03 | 2019-03-03 |
+-----+------------+------------+------------+
| 222 | 2019-03-03 | 2019-03-03 | 2019-03-03 |
+-----+------------+------------+------------+
| 555 | 2019-03-03 | 2019-03-03 | 2019-03-03 |
+-----+------------+------------+------------+

为了实现上述目标,我已经做到了

insert overwrite table histroy partition(load_date) 
    select pk, from_d, 
    case when pk = '111' then '2019-03-02' else to_d end as to_d, 
    load_date 
from history 
where load_date = '2019-02-02';

insert overwrite table history partition(load_date) 
    select pk, from_d, 
    case when pk = '222' then '2019-03-02' else to_d end as to_d, 
    load_date 
from history 
where load_date = '2019-01-01';

然后插入 STG 表数据
我正在实现我想要的,但这是一个乏味的过程,必须有更好的方法比这一个。
注意:我不想使用 Update 关于这个问题的陈述。插入覆盖是我要找的

oyjwcjzk

oyjwcjzk1#

您可以分两个步骤来完成这项工作,这在性能方面可能会更好。
1.创建临时表 load_date 为每个 pk . 每次作业/脚本运行时都可以覆盖此表。 

create table if not exists rank_load_date_pk as 
select pk,from_d,to_d,row_number() over(partition by pk order by load_date desc) as rnum
from history
;

2.接下来需要处理的场景有3种。 pk 两者都存在 stg 以及 history . 在这种情况下,应选择最新的前一行,并对其进行适当的计算 to_d 剩下的 pk 来自 history . 在这种情况下,为每个行选择所有非最新的行 pk 中的所有行 stg   sql语句

insert overwrite table history partition(load_date) 
--common pk's with the latest load_date rows
select r.pk,r.from_d,coalesce(date_sub(s.to_d,1),r.to_d) as to_d,coalesce(s.to_d,r.load_date) as load_date
from rank_load_date_pk r
left join stg s on s.pk = r.pk 
where r.rnum = 1
union all
--remaining rows
select pk,from_d,to_d,load_date
from rank_load_date_pk 
where rnum > 1
union all
--stg all rows
select pk,from_d,to_d,to_d as load_date
from stg 
;
cidc1ykv

cidc1ykv2#

你可以像下面这样做
首先创建一个表并为每一行分配行号 PK 就像下面一样

create table stg_row_num as select *, 
               row_number() over ( partition by pk order by load_date desc) as row_num from stg;

上面的查询应该为您提供如下表

+---+----------+----------+----------+--------+
| pk|    from_d|      to_d| load_date| row_num|
+---+----------+----------+----------+--------+
|111|2019-03-03|2019-03-03|2019-03-03|       1|
|111|2019-02-02|2019-02-02|2019-02-02|       2|
|111|2019-01-01|2019-01-01|2019-01-01|       3|
|222|2019-03-03|2019-03-03|2019-03-03|       1|
|222|2019-01-01|2019-01-01|2019-01-01|       2|
|333|2019-02-02|2019-02-02|2019-02-02|       1|
|333|2019-01-01|2019-01-01|2019-01-01|       2|
|444|2019-02-02|2019-02-02|2019-02-02|       1|
|555|2019-03-03|2019-03-03|2019-03-03|       1|
+---+----------+----------+----------+--------+

一旦你有了上表,然后使用 LAG 功能如下

select pk, from_d,           
            case when row_num = 1 then to_d else date_sub(lag(to_d) over (), 1) end as to_d, 
            row_num from table;

这会给你想要的结果

+---+----------+----------+-------------------+
| pk|    from_d|      to_d|row_number_window_0|
+---+----------+----------+-------------------+
|111|2019-03-03|2019-03-03|                  1|
|111|2019-02-02|2019-03-02|                  2|
|111|2019-01-01|2019-02-01|                  3|
|222|2019-03-03|2019-03-03|                  1|
|222|2019-01-01|2019-03-02|                  2|
|333|2019-02-02|2019-02-02|                  1|
|333|2019-01-01|2019-02-01|                  2|
|444|2019-02-02|2019-02-02|                  1|
|555|2019-03-03|2019-03-03|                  1|
+---+----------+----------+-------------------+

希望这有帮助

相关问题