合并最终表中的更新记录

dm7nw8vv 于 2021-06-24 发布在 Hive

关注(0)|答案(2)|浏览(278)

我在配置单元中有一个如下形式的用户表：

User: 
Id    String,
Name  String,
Col1  String,
UpdateTimestamp Timestamp

我将从以下格式的文件中插入此表中的数据：
i/u，记录写入文件时的时间戳，id，名称，col1，updatetimestamp
e、 g.对于插入id为1的用户：

I,2019-08-21 14:18:41.002947,1,Bob,stuff,123456

以及为id为1的同一用户更新col1:

U,2019-08-21 14:18:45.000000,1,,updatedstuff,123457

未更新的列返回为null。
现在，在hive中使用staging表中的load-in-path，然后忽略staging表中的前两个字段，就可以很容易地进行简单的插入。
但是，如何处理更新语句？因此，我在Hive中的最后一行如下所示：

1,Bob,updatedstuff,123457

我想在一个临时表中插入所有行，然后执行某种合并查询。有什么想法吗？

sql Hive

来源：https://stackoverflow.com/questions/57591030/merge-update-records-in-a-final-table

2条答案

按热度按时间

biswetbf1#

可以使用 last_value() 与 null 选项：

select h.id,
       coalesce(h.name, last_value(h.name, true) over (partition by h.id order by h.timestamp) as name,
       coalesce(h.col1, last_value(h.col1, true) over (partition by h.id order by h.timestamp) as col1,
       update_timestamp
from history h;

你可以用 row_number() 以及子查询（如果需要最新的记录）。

赞(0）回复(0）举报 2021-06-24

wwwo4jvm2#

通常使用merge语句，您的“文件”在id上仍然是唯一的，merge语句将确定是否需要将其作为新记录插入，或者更新该记录中的值。
但是，如果文件是不可协商的，并且总是采用i/u格式，则可以按照您的建议，将过程分为两个步骤：插入，然后更新。
为了在配置单元中执行更新，需要将users表存储为orc，并在集群上启用acid。在我的示例中，我将使用一个集群键和事务表属性创建users表：

create table test.orc_acid_example_users
(
  id int
  ,name string
  ,col1 string
  ,updatetimestamp timestamp
)
clustered by (id) into 5 buckets
stored as ORC
tblproperties('transactional'='true');

在插入语句之后，bob记录会在 col1 :

至于更新-您可以使用update或merge语句来处理这些问题。我想关键是 null 价值观。如果文件中的暂存表具有 null 价值观。下面是合并staging tables字段的合并示例。基本上，如果staging表中有一个值，则接受该值，否则返回到原始值。

merge into test.orc_acid_example_users as t
  using test.orc_acid_example_staging as s
on t.id = s.id
  and s.type = 'U'
when matched
  then update set name = coalesce(s.name,t.name), col1 = coalesce(s.col1, t.col1)

现在鲍勃将展示“更新的东西”

快速免责声明-如果您在暂存表中有多个bob更新，事情会变得一团糟。在执行更新/合并之前，您需要有一个预处理步骤来获取所有更新的最新非空值。hive并不是一个完整的事务性数据库-最好是源代码在更新时发送完整的用户记录，而不是只发送更改的字段。

赞(0）回复(0）举报 2021-06-24

我来回答

合并最终表中的更新记录

2条答案

相关问题

热门标签

最新问答