pig查询-选择整个包

siv3szwd  于 2021-06-25  发布在  Pig
关注(0)|答案(2)|浏览(308)

如果我有这样的包:

({(11983070,39010451,1139539437),(11983070,53425518,11000)})

我想选一个包,里面有 MAX 最后一个值($2),但只能单独获得每个包的最大值。
我希望输出是

{(11983070,39010451,1139539437)}

但不能让它工作。你知道吗?

xdyibdwo

xdyibdwo1#

想法是首先找到max,然后将max值作为一个额外的列添加,然后过滤掉所有不满足$2==$maxvalue的行。
遵循粗略的代码-改编自此解决方案

records = LOAD 'input.txt'  AS (first:int, second:int, third:int);
records_group = GROUP records ALL;
with_max = FOREACH records_group 
       GENERATE
           FLATTEN(records.(first, second, third)), MAX(records.third) as max_third;
max_row = FILTER with_max BY records.third == max_third
qcbq4gxm

qcbq4gxm2#

虽然您可以在纯pig中这样做,但是使用udf应该更有效。这也很简单:
我的自定义项.py


# !/usr/bin/python

@outputschema('Values:{(first:int, second:int, third:int)}')
def get_max(BAG)
    v = max(BAG, key=lambda x: x[2])

    # Since you want it to return in a bag, v needs to be in a list
    return [v]

Pig手稿

REGISTER 'myudfs.py' USING jython AS myudfs ;

-- A is your input
B = FOREACH A GENERATE myudfs.get_max(my_input_bag) ;

相关问题