我有一个由email、day(timestamp)、id和多个其他列组成的表。
对于每个电子邮件条目,我想计算与该电子邮件相关联的唯一ID在前3天的数量。
+-------------------+-------------------------+------------+----+
| email | day | other cols | id |
+-------------------+-------------------------+------------+----+
| user1@gmail.com | 2020-06-21 16:31:00 UTC | ... | 0 |
| user1@gmail.com | 2020-06-22 14:54:00 UTC | ... | 1 |
| user1@gmail.com | 2020-06-23 08:23:00 UTC | ... | 0 |
| user1@gmail.com | 2020-06-24 13:51:00 UTC | ... | 0 |
| user1@gmail.com | 2020-06-25 09:54:00 UTC | ... | 2 |
| user1@gmail.com | 2020-06-25 12:25:00 UTC | ... | 0 |
| user1@gmail.com | 2020-06-26 15:21:00 UTC | ... | 2 |
| user2@hotmail.com | 2020-06-21 12:23:00 UTC | ... | 0 |
| user2@hotmail.com | 2020-06-21 16:54:00 UTC | ... | 0 |
| user2@hotmail.com | 2020-06-22 08:23:00 UTC | ... | 0 |
| user2@hotmail.com | 2020-06-22 12:13:00 UTC | ... | 1 |
| user2@hotmail.com | 2020-06-24 09:32:00 UTC | ... | 1 |
| user2@hotmail.com | 2020-06-25 05:45:00 UTC | ... | 1 |
| user2@hotmail.com | 2020-06-26 12:32:00 UTC | ... | 2 |
| user2@hotmail.com | 2020-06-27 19:53:00 UTC | ... | 1 |
+-------------------+-------------------------+------------+----+
附加列应如下所示:
+-------------------+-------------------------+------------+----+-----------------------------+
| email | day | other cols | id | distinct ids in last 3 days |
+-------------------+-------------------------+------------+----+-----------------------------+
| user1@gmail.com | 2020-06-21 16:31:00 UTC | ... | 0 | 1 |
| user1@gmail.com | 2020-06-22 14:54:00 UTC | ... | 1 | 2 |
| user1@gmail.com | 2020-06-23 08:23:00 UTC | ... | 0 | 2 |
| user1@gmail.com | 2020-06-24 13:51:00 UTC | ... | 0 | 2 |
| user1@gmail.com | 2020-06-25 09:54:00 UTC | ... | 2 | 3 |<- 3, because ids 0, 1 and 2 have been seen in previous 3 days
| user1@gmail.com | 2020-06-25 12:25:00 UTC | ... | 0 | 3 |
| user1@gmail.com | 2020-06-26 15:21:00 UTC | ... | 2 | 2 |
| user2@hotmail.com | 2020-06-21 12:23:00 UTC | ... | 0 | 1 |
| user2@hotmail.com | 2020-06-21 16:54:00 UTC | ... | 0 | 1 |
| user2@hotmail.com | 2020-06-22 08:23:00 UTC | ... | 0 | 1 |
| user2@hotmail.com | 2020-06-22 12:13:00 UTC | ... | 1 | 2 |
| user2@hotmail.com | 2020-06-24 09:32:00 UTC | ... | 1 | 2 |
| user2@hotmail.com | 2020-06-25 05:45:00 UTC | ... | 1 | 2 |
| user2@hotmail.com | 2020-06-26 12:32:00 UTC | ... | 1 | 1 |
| user2@hotmail.com | 2020-06-27 19:53:00 UTC | ... | 1 | 1 |
+-------------------+-------------------------+------------+----+-----------------------------+
我已经尝试使用一个窗口函数来分区的电子邮件和计数不同的ID在过去3天。
COUNT(DISTINCT id) OVER (PARTITION BY email ORDER BY UNIX_DATE(PARSE_DATE('%Y-%m-%d', day))*24*3600 RANGE BETWEEN 3*24*3600 PRECEDING AND CURRENT ROW)
但是,这是不允许的:
Window ORDER BY is not allowed if DISTINCT is specified
有一些解决堆栈溢出的方法,比如这样。但是,我不确定它是否能解释在计算唯一ID之前通过电子邮件进行分区的需要。
如果你能给我指点,我会很感激的。如果更简单的话,我也愿意接受使用日期而不是时间戳的解决方案。
2条答案
按热度按时间7gyucuyw1#
下面是bigquery标准sql
您可以使用下面的示例中的问题中的示例数据来测试、播放上述内容
5uzkadbs2#
大多数(如果不是全部)数据库不支持
distinct
在窗口函数中。在bigquery中,通常使用窗口字符串或数组聚合来解决此问题:子查询聚合所有
id
在字符串中,使用string_agg()
作为窗口函数;然后,外部查询对字符串进行拆分和取消嵌套,并对不同的字符串进行计数id
s。