min-max-group-wise和filter-without-join-in-pig

ssgvzors  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(337)

我试图为每组找到(最大+最小)/2。下面是我的模式

UrlXpathsCount: {url: chararray,leafpathstr: chararray,urlpath_count: long}

我正在尝试按url字段对其进行分组

byUrl = GROUP UrlXpathsCount by url;

我试着用下面的方法找到(max+min)/2。

midRangeByUrl = FOREACH byUrl{
    urls_desc = order UrlXpathsCount by urlpath_count desc;
    urls_max = limit urls_desc 1;
    urls_asc = order UrlXpathsCount by urlpath_count asc;
    urls_min = limit urls_asc 1;

    GENERATE FLATTEN(urls_max),FLATTEN(urls_min);
};

以下是midrangebyurl的架构

midRangeByUrl: {urls_max::url: chararray,urls_max::leafpathstr: chararray,urls_max::urlpath_count: long,urls_min::url: chararray,urls_min::leafpathstr: chararray,urls_min::urlpath_count: long}

我现在面临的问题是,添加一个flatten(group),flatten(url\u max),flatten(url\u min)给了我很多我不想要的组合。
我想得到每个组的最大值+最小值/2。
为此,我将最大值和最小值的urlpath\u计数投影如下

computeMidRange = FOREACH midRangeByUrl generate urls_max::url as mid_url,((DOUBLE)urls_max::urlpath_count+(DOUBLE) urls_min::urlpath_count)/2 as midRange;

我把两张table连接起来

/* Join computeMidRange  and UrlXpathsCount */
midRangeJoin = join UrlXpathsCount by url , computeMidRange by mid_url using 'replicated';
midRangeOut = FOREACH midRangeJoin GENERATE UrlXpathsCount::url as url,UrlXpathsCount::leafpathstr as leafpathstr,
    UrlXpathsCount::urlpath_count as urlpath_count,computeMidRange::midRange as midRange;

然后用过滤器过滤

templates = FILTER midRangeOut by urlpath_count > midRange;

我想避开中档。通过某种方式计算midrangebyurl并投影以下字段url,urlpath\u count,leafpathstr,(min+max)/2而不使用join。
请帮我弄清楚这件事。谢谢

628mspwn

628mspwn1#

你可以用内置的 MAX 以及 MIN 自定义项:

UrlXpathsCount = load 'your_data' using PigStorage(',') as (url: chararray,leafpathstr: chararray,urlpath_count: long);
B = GROUP UrlXpathsCount by url;
C = foreach B generate group as url, MAX(UrlXpathsCount.urlpath_count) as max_count, 
                                     MIN(UrlXpathsCount.urlpath_count) as min_count;
D = foreach C generate url, ((double)max_count + (double)min_count)/2 as val;

这将完全满足您的需要,没有嵌套的foreach或join。我把计算分为 C 以及 D 避免排很长的队,但你也可以只排一行。记住把这些价值观 double ,因为你的 urlpath_count 是一个 long 所以如果你不施展你就不会得到小数。

相关问题