我试图为每组找到(最大+最小)/2。下面是我的模式
UrlXpathsCount: {url: chararray,leafpathstr: chararray,urlpath_count: long}
我正在尝试按url字段对其进行分组
byUrl = GROUP UrlXpathsCount by url;
我试着用下面的方法找到(max+min)/2。
midRangeByUrl = FOREACH byUrl{
urls_desc = order UrlXpathsCount by urlpath_count desc;
urls_max = limit urls_desc 1;
urls_asc = order UrlXpathsCount by urlpath_count asc;
urls_min = limit urls_asc 1;
GENERATE FLATTEN(urls_max),FLATTEN(urls_min);
};
以下是midrangebyurl的架构
midRangeByUrl: {urls_max::url: chararray,urls_max::leafpathstr: chararray,urls_max::urlpath_count: long,urls_min::url: chararray,urls_min::leafpathstr: chararray,urls_min::urlpath_count: long}
我现在面临的问题是,添加一个flatten(group),flatten(url\u max),flatten(url\u min)给了我很多我不想要的组合。
我想得到每个组的最大值+最小值/2。
为此,我将最大值和最小值的urlpath\u计数投影如下
computeMidRange = FOREACH midRangeByUrl generate urls_max::url as mid_url,((DOUBLE)urls_max::urlpath_count+(DOUBLE) urls_min::urlpath_count)/2 as midRange;
我把两张table连接起来
/* Join computeMidRange and UrlXpathsCount */
midRangeJoin = join UrlXpathsCount by url , computeMidRange by mid_url using 'replicated';
midRangeOut = FOREACH midRangeJoin GENERATE UrlXpathsCount::url as url,UrlXpathsCount::leafpathstr as leafpathstr,
UrlXpathsCount::urlpath_count as urlpath_count,computeMidRange::midRange as midRange;
然后用过滤器过滤
templates = FILTER midRangeOut by urlpath_count > midRange;
我想避开中档。通过某种方式计算midrangebyurl并投影以下字段url,urlpath\u count,leafpathstr,(min+max)/2而不使用join。
请帮我弄清楚这件事。谢谢
1条答案
按热度按时间628mspwn1#
你可以用内置的
MAX
以及MIN
自定义项:这将完全满足您的需要,没有嵌套的foreach或join。我把计算分为
C
以及D
避免排很长的队,但你也可以只排一行。记住把这些价值观double
,因为你的urlpath_count
是一个long
所以如果你不施展你就不会得到小数。