假设我的table是这样的:
cust_id, domain, year, mon, day
1, google.au, 2018, 10, 1
2, virgin.com.au, 2018, 10, 1
3, hotmail.au, 2018, 10, 1
4, yahoo.au, 2018, 10, 1
1, foobar.au, 2018, 10, 1
3, foobar.com.au, 2018, 10, 1
15, haha.com, 2018, 10, 1
11, hehe.net, 2018, 10, 1
我需要按年/月/日分组,并根据不同的条件聚合列:
1) count of distinct domains ending with .au but not .com.au
2) count of distinct domains ending with .com.au
3) count of distinct hostnames where cust_id in a specific list, let's assume (1, 2, 3, 4)
4) count of all distinct hostnames
所以我的输出看起来像:
2018, 10, 1, 4, 2, 6, 8
我倾向于对每个条件使用子查询,然后加入它们:
select condition_1.year, condition_1.mon, condition_1.day, condition_1.c1, condition_3.c3, condition_4.c4
from
(select year, mon, day, count(distinct domain) c1 from mytable where year = 2018 and mon = 10 and day = 1
and domain rlike '[.]au' and domain not rlike '[.]com[.]au'
group by year, mon, day) condition_1
full outer join
(select count(distinct domain) c2 from mytable where year = 2018 and mon = 10 and day = 1
and domain rlike '[.]com[.]au') condition_2
full outer join
(select count(distinct domain) c3 from mytable where year = 2018 and mon = 10 and day = 1
and cust_id in (1, 2, 3, 4)) condition_3
full outer join
(select count(distinct hostname) c4 from mytable where year = 2018 and mon = 10 and day = 1) condition_4
虽然我想不出更好的办法,但这似乎效率极低。案例陈述在这里不起作用,因为我需要不同的计数。如何才能更有效地实现这一点?
2条答案
按热度按时间nc1teljy1#
使用
collect_set()
-它收集不同的集合,忽略空值,使用size
函数获取元素数(已经不同):mspsb9vt2#
这可以通过正则表达式和条件聚合来实现。
正则表达式
(?<!\.com)\.au$
使用否定的lookbackAssert检查前面的字符.au
不是.com
.$
元字符表示匹配.au
作为字符串中的最后3个字符。.
必须带着\
.