如何在同一模式中连接两个数据集

yiytaume  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(440)

嗨,我对pig编程比较陌生,遇到了一个我很难解决的问题:
我有两个数据集
答:(accountid:chararray, title:chararray, genre:chararray)

("A123", "Harry Potter", "Action/Adventure")
("A123", "Sherlock Holmes", "Mystery")
("B456", "James Bond", "Action")
("B456", "Hamlet", "Drama")

第二:(accountid:chararray, title:chararray, genre:chararray)

("B456", "Percy Jackson", "Action/Adventure")
("B456", "Elementary", "Mystery")
("A123", "Divergent", "Action")
("A123", "Downton Abbey", "Drama")

我想要的结果应该是
(accountid:charray, {(),(),...}

(A123, {("A123", "Harry Potter", "Action/Adventure"),
        ("A123", "Sherlock Holmes", "Mystery"),
        ("A123", "Divergent", "Action"),
        ("A123", "Downton Abbey", "Drama")
        })

(B456, {("B456", "James Bond", "Action"),
        ("B456", "Hamlet", "Drama"),
        ("B456", "Percy Jackson", "Action/Adventure"),
        ("B456", "Elementary", "Mystery")
        })

目前我正在做:
ans=按accountid加入a,按accountid加入b;
但结果看起来
架构:(accountid:chararray, {(accountid:chararray, title:chararray, genre:chararray), ...})

(B456, {("B456", "James Bond", "Action"),
        ("B456", "Hamlet", "Drama")}
       "B456", {
        ("B456", "Percy Jackson", "Action/Adventure"),
        ("B456", "Elementary", "Mystery")
        })

你知道我做错了什么吗。

kb5ga3dv

kb5ga3dv1#

试试这个:

-- IMPORTANT: register datafu.jar
define BagConcat datafu.pig.bags.BagConcat();
A = load 'A' using PigStorage(',') as (id:chararray, title:chararray, genre:chararray);      
B = load 'B' using PigStorage(',') as (id:chararray, title:chararray, genre:chararray);   
C = cogroup A by id, B by id;
D = foreach C generate BagConcat(A, B);
dump D;

join只是将两个关系中的行按原样连接起来。你想完成两件事:
对每个关系中属于同一帐户的所有行进行分组
连接两个“分组”关系(仅获取两个关系中存在的ID)
这两个动作由cogroup执行。我读到的最好的解释是:http://joshualande.com/cogroup-in-pig/
您的关系现在将包含组键(id)和两个包(一个来自a,一个来自b),每个包包含原始关系中的行;将它们“合并”为一个包的方法是使用datafu.jar中的bagconcat函数。datafu是一个Pig自定义项库,里面有很多好东西。你可以在这里阅读:http://datafu.incubator.apache.org/docs/datafu/guide/bag-operations.html

相关问题