q:如何从pig中的复杂数据结构中删除最不重要的包

8ljdwjyq  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(313)

最初我有这样的结构:

+-------+-------+----+----+----+-----+
| time  | type  | s1 | s2 | id | p1  |
+-------+-------+----+----+----+-----+
| 10:30 | send  | a  | b  |  1 | 110 |
| 10:35 | send  | c  | d  |  1 | 120 |
| 10:31 | reply | e  | f  |  3 | 221 |
| 10:33 | reply | a  | c  |  1 | 210 |
| 10:34 | send  | a  | a  |  3 | 113 |
| 10:32 | reply | c  | d  |  3 | 157 |
+-------+-------+----+----+----+-----+

我想规范化表格:
按id对条目进行分组,
在每个组中,找出最早的发送类型条目,
用最旧的send type条目中的值替换其他条目的s1、s2

+-------+-------+----+----+----+-----+
| time | type | s1 | s2 | id | p1 |
+-------+-------+----+----+----+-----+
| 10:30 | send | a | b | 1 | 110 |
| 10:35 | send | a | b | 1 | 120 |
| 10:33 | reply | a | b | 1 | 210 |
| 10:31 | reply | a | a | 3 | 221 |
| 10:34 | send | a | a | 3 | 113 |
| 10:32 | reply | a | a | 3 | 157 |
+-------+-------+----+----+----+-----+

我就是这样试图解决这个问题的:

events_groupby_id = GROUP events BY id;
events_normalized = FOREACH events_groupby_id {
f_reqs = FILTER events BY type matches 'send';
o_reqs = ORDER events BY time ASC;
req = LIMIT o_reqs 1;
GENERATE req, events;
};

我被困在这里了。因为我发现事件变成了一个有嵌套包的复杂结构,我不知道如何正确地展开。
事件|标准化|req:bag{:tuple()}events:bag{:tuple()}
从这里开始,我应该做些什么来实现我想要的数据结构?如果有人能帮我,我会非常感激的。谢谢您。
enxuqcxy

enxuqcxy1#

你可以把袋子打开 events_normalized 使用 FLATTEN :

events_flattened = FOREACH events_normalized GENERATE 
    FLATTEN(req), 
    FLATTEN(events);

这会在 req 以及 events ,但由于在 req ,则每个原始条目只有一条记录。的架构 events_flattened 是:

req::time | req::type | req::s1 | req::s2 | req::id | req::p1 | events::time | events::type | events::s1 | events::s2 | events::id | events::p1

所以现在您可以使用 events 对于原始条目和 req 对于最旧发送类型条目的替换项:

final = FOREACH events_flattened GENERATE 
    events::time AS time, 
    events::type AS type, 
    req::s1 AS s1, 
    req::s2 AS s2, 
    events::id AS id, 
    events::p1 AS p1;

相关问题