使用pig脚本查找类似的条目

gmxoilav 于 2021-06-24 发布在 Pig

关注(0)|答案(2)|浏览(399)

我有如下数据

1,ref1,1200,USD,CR  
2,ref1,1200,USD,DR  
3,ref2,2100,USD,DR  
4,ref2,800,USD,CR  
5,ref2,700,USD,CR  
6,ref2,600,USD,CR

我想将这些记录分组，其中field2匹配，sum（field3）匹配，field5相反（意味着如果lhs是“cr”，那么rhs应该是“dr”，反之亦然）
如何使用pig脚本实现这一点？

apache-pig

来源：https://stackoverflow.com/questions/17420484/find-similar-entries-using-pig-script

2条答案

按热度按时间

a8jjtwal1#

您也可以这样做：

data = LOAD 'myData' USING PigStorage(',') AS
       (field1: int, field2: chararray,
        field3: int, field4: chararray,
        field5: chararray) ;

B = FOREACH (GROUP data BY (field2, field5)) GENERATE group.field2, data ;

-- Since in B there will always be two sets of field2 (one for CR and one for DR)
-- grouping by field2 again will get all pairs of CR and DR
-- (where the sums are equal of course)
C = GROUP B BY (field2, SUM(field3)) ;

最后一步的架构和输出：

C: {group: (field2: chararray,long),B: {(field2: chararray,data: {(field1: int,field2: chararray,field3: int,field4: chararray,field5: chararray)})}}
((ref1,1200),{(ref1,{(1,ref1,1200,USD,CR)}),(ref1,{(2,ref1,1200,USD,DR)})})
((ref2,2100),{(ref2,{(4,ref2,800,USD,CR),(5,ref2,700,USD,CR),(6,ref2,600,USD,CR)}),(ref2,{(3,ref2,2100,USD,DR)})})

输出输出现在有点笨拙，但这会使它变得清晰：

-- Make sure to look at the schema for C above
 D = FOREACH C { 
        -- B is a bag containing tuples in the form: B: {(field2, data)}  
        -- What we want is to just extract out the data field and nothing else
        -- so we can go over each tuple in the bag and pull out
        -- the second element (the data we want).                     
        justbag = FOREACH B GENERATE FLATTEN($1) ;  
        -- Without FLATTEN the schema for justbag would be:
        -- justbag: {(data: (field1, ...))}  
        -- FLATTEN makes it easier to access the fields by removing data:
        -- justbag: {(field1, ...)}
        GENERATE justbag ;                                                  
 }

对此：

D: {justbag: {(data::field1: int,data::field2: chararray,data::field3: int,data::field4: chararray,data::field5: chararray)}}
({(1,ref1,1200,USD,CR),(2,ref1,1200,USD,DR)})
({(4,ref2,800,USD,CR),(5,ref2,700,USD,CR),(6,ref2,600,USD,CR),(3,ref2,2100,USD,DR)})

赞(0）回复(0）举报 2021-06-24

pxyaymoc2#

我不确定我是否理解您的要求，但您可以加载数据，分成两组（筛选/拆分）和cogroup，例如：

data = load ... as (field1: int, field2: chararray, field3: int, field4: chararray, field5: chararray);

crs= filter data by field5='CR';
crs_grp = group crs by field1;
crs_agg = foreach crs_grp generate group.field1 as field1, sum(crs.field3);
drs = filter data by field5='DR';
drs_grp = group drs by field1;
drs_agg = foreach drs_grp generate group.field1 as field1, sum(drs.field3);
g = COGROUP crs_agg BY (field1, field3), drs_agg BY (field1, field3);

赞(0）回复(0）举报 2021-06-24

我来回答

使用pig脚本查找类似的条目

2条答案

相关问题

热门标签

最新问答