来自join-in-pig的唯一性

9jyewag0  于 2021-06-03  发布在  Hadoop
关注(0)|答案(1)|浏览(254)

我有两张table:
姓名和电话地址
名称\u ssn包含joe xx x jim xx x bob xx x
电话地址:joe 999-999-9990 sunset florida joe 999-999-9991 sunset florida joe 999-999-9992 sunset florida jim 999-999-9994 sunny ca jim 999-999-9994 sunny ca bob 999-999-9999 raleigh va
我想加入并得到:乔 xx x佛罗里达日落吉姆 xx x阳光ca鲍勃 xx x罗利弗吉尼亚州
我对Pig是新来的,而且我很无知。。。
谢谢你的帮助,
克里斯

hrysbysz

hrysbysz1#

听起来你想在Pig身上做一个内部连接。下面的代码应该可以帮助您:
名称地址.pig

--Load the two data files
namessn = LOAD 'Name_SSN.csv' USING PigStorage(',') AS (name:chararray, ssn:chararray);
phoneaddr = LOAD 'Phone_Address.csv' USING PigStorage(',') AS (name:chararray, phone:chararray, address:chararray);

--Perform the join of the two datasets on the "name" field
data_join = JOIN namessn BY name, phoneaddr BY name;

--The join combined all fields from both datasets.  
--We just want a few fields, so generate them specifically.
data = FOREACH data_join GENERATE namessn::name AS name, namessn::ssn AS ssn, phoneaddr::address AS address;

--You didn't say if you wanted the data distinct or not.
--If you want only one row per distinct user, use this alias.
data_distinct = DISTINCT data;

--Dump all of the aliases so you can see what's in them.
dump namessn;
dump phoneaddr;

dump data;
dump data_distinct;

输出自 dump namessn ```
(Joe,xxx-xx-xxx1)
(Jim,xxx-xx-xxx2)
(Bob,xxx-xx-xxx3)

输出自 `dump phoneaddr` ```
(Joe,999-999-9990,Sunset Florida)
(Joe,999-999-9991,Sunset Florida)
(Joe,999-999-9992,Sunset Florida)
(Jim,999-999-9994,Sunny CA)
(Jim,999-999-9994,Sunny CA)
(Bob,999-999-9999,Raleigh VA)

输出自 dump data ```
(Bob,xxx-xx-xxx3,Raleigh VA)
(Jim,xxx-xx-xxx2,Sunny CA)
(Jim,xxx-xx-xxx2,Sunny CA)
(Joe,xxx-xx-xxx1,Sunset Florida)
(Joe,xxx-xx-xxx1,Sunset Florida)
(Joe,xxx-xx-xxx1,Sunset Florida)

输出自 `dump data_distinct` ```
(Bob,xxx-xx-xxx3,Raleigh VA)
(Jim,xxx-xx-xxx2,Sunny CA)
(Joe,xxx-xx-xxx1,Sunset Florida)

相关问题