计数不同的元素在一个袋子里

v09wglhw  于 2021-06-21  发布在  Pig
关注(0)|答案(3)|浏览(230)

假设我有个化名 transactions 有了这些数据:

person  store  spent
A       S      3.3
A       S      4.7
B       S      1.2
B       T      3.4

我想知道有多少不同的人去了每个商店,他们在那里花了多少钱:

store   visitors  revenue
S       2         9.2
T       1         3.4

我希望我能一步到位:

stores = foreach (group transactions by store) generate
  group as store, SUM(transactions.spent) as revenue, 
  COUNT(UNIQUE(transactions.person)) as visitors;

但看起来并没有 UNIQUE .
我是不是被两步走的过程困住了?

tr1 = foreach (group transactions by (store,person)) generate
  group.store as store, SUM(spent) as revenue;
stores = foreach (group tr1 by store) generate
  group as store, COUNT(tr1) as visitors, SUM(revenue) as revenue;
nafvub8i

nafvub8i1#

你应该做你想做的事。
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#distinct

gijlo24d

gijlo24d2#

使用独特的内置自定义项,您只需替换 UNIQUEorg.apache.pig.builtin.Distinct ,

stores = foreach (group transactions by store) generate
    group as store, SUM(transactions.spent) as revenue, 
    COUNT(org.apache.pig.builtin.Distinct(transactions.person)) as visitors;
rkue9o1l

rkue9o1l3#

这里有两种方法
1) 使用distinct内置udf(不是distinct pig操作符)。抱歉,我没有代码示例,也不知道它将如何执行。
2) 将嵌套foreach与distinct运算符一起使用,如下所示:

stores = FOREACH (GROUP transactions BY store) {
    uniqueVisitors = DISTINCT visitors;
    GENERATE
        group AS store,
        COUNT(uniqueVisitors) AS visitors,
        SUM(revenue) AS revenue;
}

第二种方法的一个优点是它不应该禁用combiner:http://pig.apache.org/docs/r0.11.1/perf.html#when+使用+组合器+

相关问题