滚动日期范围中的bigquery非重复计数，列上有分区

pxiryf3j 于 2021-07-27 发布在 Java

关注(0)|答案(2)|浏览(499)

我有一个由email、day（timestamp）、id和多个其他列组成的表。
对于每个电子邮件条目，我想计算与该电子邮件相关联的唯一ID在前3天的数量。

+-------------------+-------------------------+------------+----+
|       email       |           day           | other cols | id |
+-------------------+-------------------------+------------+----+
| user1@gmail.com   | 2020-06-21 16:31:00 UTC |        ... |  0 |
| user1@gmail.com   | 2020-06-22 14:54:00 UTC |        ... |  1 |
| user1@gmail.com   | 2020-06-23 08:23:00 UTC |        ... |  0 |
| user1@gmail.com   | 2020-06-24 13:51:00 UTC |        ... |  0 |
| user1@gmail.com   | 2020-06-25 09:54:00 UTC |        ... |  2 |
| user1@gmail.com   | 2020-06-25 12:25:00 UTC |        ... |  0 |
| user1@gmail.com   | 2020-06-26 15:21:00 UTC |        ... |  2 |
| user2@hotmail.com | 2020-06-21 12:23:00 UTC |        ... |  0 |
| user2@hotmail.com | 2020-06-21 16:54:00 UTC |        ... |  0 |
| user2@hotmail.com | 2020-06-22 08:23:00 UTC |        ... |  0 |
| user2@hotmail.com | 2020-06-22 12:13:00 UTC |        ... |  1 |
| user2@hotmail.com | 2020-06-24 09:32:00 UTC |        ... |  1 |
| user2@hotmail.com | 2020-06-25 05:45:00 UTC |        ... |  1 |
| user2@hotmail.com | 2020-06-26 12:32:00 UTC |        ... |  2 |
| user2@hotmail.com | 2020-06-27 19:53:00 UTC |        ... |  1 |
+-------------------+-------------------------+------------+----+

附加列应如下所示：

+-------------------+-------------------------+------------+----+-----------------------------+
|       email       |           day           | other cols | id | distinct ids in last 3 days |
+-------------------+-------------------------+------------+----+-----------------------------+
| user1@gmail.com   | 2020-06-21 16:31:00 UTC |        ... |  0 |                           1 |
| user1@gmail.com   | 2020-06-22 14:54:00 UTC |        ... |  1 |                           2 |
| user1@gmail.com   | 2020-06-23 08:23:00 UTC |        ... |  0 |                           2 |
| user1@gmail.com   | 2020-06-24 13:51:00 UTC |        ... |  0 |                           2 |
| user1@gmail.com   | 2020-06-25 09:54:00 UTC |        ... |  2 |                           3 |<- 3, because ids 0, 1 and 2 have been seen in previous 3 days
| user1@gmail.com   | 2020-06-25 12:25:00 UTC |        ... |  0 |                           3 |
| user1@gmail.com   | 2020-06-26 15:21:00 UTC |        ... |  2 |                           2 |
| user2@hotmail.com | 2020-06-21 12:23:00 UTC |        ... |  0 |                           1 |
| user2@hotmail.com | 2020-06-21 16:54:00 UTC |        ... |  0 |                           1 |
| user2@hotmail.com | 2020-06-22 08:23:00 UTC |        ... |  0 |                           1 |
| user2@hotmail.com | 2020-06-22 12:13:00 UTC |        ... |  1 |                           2 |
| user2@hotmail.com | 2020-06-24 09:32:00 UTC |        ... |  1 |                           2 |
| user2@hotmail.com | 2020-06-25 05:45:00 UTC |        ... |  1 |                           2 |
| user2@hotmail.com | 2020-06-26 12:32:00 UTC |        ... |  1 |                           1 |
| user2@hotmail.com | 2020-06-27 19:53:00 UTC |        ... |  1 |                           1 |
+-------------------+-------------------------+------------+----+-----------------------------+

我已经尝试使用一个窗口函数来分区的电子邮件和计数不同的ID在过去3天。

COUNT(DISTINCT id) OVER (PARTITION BY email ORDER BY UNIX_DATE(PARSE_DATE('%Y-%m-%d', day))*24*3600 RANGE BETWEEN 3*24*3600 PRECEDING AND CURRENT ROW)

但是，这是不允许的：

Window ORDER BY is not allowed if DISTINCT is specified

有一些解决堆栈溢出的方法，比如这样。但是，我不确定它是否能解释在计算唯一ID之前通过电子邮件进行分区的需要。
如果你能给我指点，我会很感激的。如果更简单的话，我也愿意接受使用日期而不是时间戳的解决方案。

sql window-functions google-bigquery count Database

来源：https://stackoverflow.com/questions/63158386/bigquery-distinct-count-in-rolling-date-range-with-partition-on-column

2条答案

按热度按时间

7gyucuyw1#

下面是bigquery标准sql


# standardSQL

SELECT * EXCEPT(ids),
  (SELECT COUNT(DISTINCT id) FROM t.ids AS id) distinct_ids
FROM (
  SELECT *, ARRAY_AGG(id) OVER(preceding_days) ids
  FROM `project.dataset.table`
  WINDOW preceding_days AS (
    PARTITION BY email 
    ORDER BY UNIX_DATE(DATE(day)) 
    ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
  )
) t

您可以使用下面的示例中的问题中的示例数据来测试、播放上述内容


# standardSQL

WITH `project.dataset.table` AS (
  SELECT 'user1@gmail.com' email, TIMESTAMP'2020-06-21 16:31:00 UTC' day, '...' other_cols, 0 id UNION ALL
  SELECT 'user1@gmail.com', '2020-06-22 14:54:00 UTC', '...', 1 UNION ALL
  SELECT 'user1@gmail.com', '2020-06-23 08:23:00 UTC', '...', 0 UNION ALL
  SELECT 'user1@gmail.com', '2020-06-24 13:51:00 UTC', '...', 0 UNION ALL
  SELECT 'user1@gmail.com', '2020-06-25 09:54:00 UTC', '...', 2 UNION ALL
  SELECT 'user1@gmail.com', '2020-06-25 12:25:00 UTC', '...', 0 UNION ALL
  SELECT 'user1@gmail.com', '2020-06-26 15:21:00 UTC', '...', 2 UNION ALL
  SELECT 'user2@hotmail.com', '2020-06-21 12:23:00 UTC', '...', 0 UNION ALL
  SELECT 'user2@hotmail.com', '2020-06-21 16:54:00 UTC', '...', 0 UNION ALL
  SELECT 'user2@hotmail.com', '2020-06-22 08:23:00 UTC', '...', 0 UNION ALL
  SELECT 'user2@hotmail.com', '2020-06-22 12:13:00 UTC', '...', 1 UNION ALL
  SELECT 'user2@hotmail.com', '2020-06-24 09:32:00 UTC', '...', 1 UNION ALL
  SELECT 'user2@hotmail.com', '2020-06-25 05:45:00 UTC', '...', 1 UNION ALL
  SELECT 'user2@hotmail.com', '2020-06-26 12:32:00 UTC', '...', 2 UNION ALL
  SELECT 'user2@hotmail.com', '2020-06-27 19:53:00 UTC', '...', 1 
)
SELECT * EXCEPT(ids),
  (SELECT COUNT(DISTINCT id) FROM t.ids AS id) distinct_ids
FROM (
  SELECT *, ARRAY_AGG(id) OVER(preceding_days) ids
  FROM `project.dataset.table`
  WINDOW preceding_days AS (
    PARTITION BY email 
    ORDER BY UNIX_DATE(DATE(day)) 
    ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
  )
) t

赞(0）回复(0）举报 2021-07-27

5uzkadbs2#

大多数（如果不是全部）数据库不支持 distinct 在窗口函数中。在bigquery中，通常使用窗口字符串或数组聚合来解决此问题：

select 
    t.* except(ids),
    (select count(distinct id) from unnest(split(ids)) as id) cnt_distinct_id
from (
    select 
        t.*,
        string_agg(id) over(
            partition by email 
            order by unix_date(parse_date('%y-%m-%d', day))*24*3600 
            range between 3 * 24 * 3600 preceding and current row 
        ) ids
    from mytable t
) t

子查询聚合所有 id 在字符串中，使用 string_agg() 作为窗口函数；然后，外部查询对字符串进行拆分和取消嵌套，并对不同的字符串进行计数 id s。

赞(0）回复(0）举报 2021-07-27

我来回答

滚动日期范围中的bigquery非重复计数，列上有分区

2条答案

相关问题

热门标签

最新问答