如何在clickhouse中按时间段分组并用空值填充丢失的数据/0s

hmae6n7t  于 2021-07-15  发布在  ClickHouse
关注(0)|答案(4)|浏览(1568)

假设我有一个给定的时间范围。为了解释,让我们考虑一些简单的事情,比如2018年全年。我想从clickhouse查询数据,作为每个季度的总和聚合,因此结果应该是4行。
问题是,我只有两个季度的数据,所以当使用 GROUP BY quarter ,只返回两行。

SELECT
     toStartOfQuarter(created_at) AS time,
     sum(metric) metric
 FROM mytable
 WHERE
     created_at >= toDate(1514761200) AND created_at >= toDateTime(1514761200)
    AND
     created_at <= toDate(1546210800) AND created_at <= toDateTime(1546210800)
 GROUP BY time
 ORDER BY time
``` `1514761200` – 
2018-01-01 `1546210800` –  `2018-12-31` 这将返回:

time metric
2018-01-01 345
2018-04-01 123

我需要:

time metric
2018-01-01 345
2018-04-01 123
2018-07-01 0
2018-10-01 0

这是一个简化的示例,但在实际用例中,聚合将是5分钟而不是四分之一,groupby至少还有一个属性,如 `GROUP BY attribute1, time` 所以期望的结果是

time metric attribute1
2018-01-01 345 1
2018-01-01 345 2
2018-04-01 123 1
2018-04-01 123 2
2018-07-01 0 1
2018-07-01 0 2
2018-10-01 0 1
2018-10-01 0 2

有没有办法填满整个给定的时间间隔?就像XDB一样 `fill` 组或timescaledb的参数 `time_bucket()` 函数 `generate_series()` 我试图搜索clickhouse文档和github问题,但这似乎还没有实现,所以问题可能是是否有任何解决方法。
gzszwxb4

gzszwxb41#

可以使用“number”函数生成零值。然后使用union all和already将您的查询和零值连接起来,并根据获得的数据进行分组。
因此,您的查询将如下所示:

SELECT SUM(metric),
       time
  FROM (
        SELECT toStartOfQuarter(toDate(1514761200+number*30*24*3600))  time,
               toUInt16(0) AS metric
          FROM numbers(30)

     UNION ALL 

          SELECT toStartOfQuarter(created_at) AS time,
               metric
          FROM mytable
         WHERE created_at >= toDate(1514761200)
           AND created_at >= toDateTime(1514761200)
           AND created_at <= toDate(1546210800)
           AND created_at <= toDateTime(1546210800)
       )
 GROUP BY time
 ORDER BY time

注意:touint16(0)-零值必须与 metrics

py49o6xq

py49o6xq2#

在clickhouse 19.14中,您可以使用 WITH FILL 条款。它可以这样填满四分之一的空间:

WITH
    (
        SELECT toRelativeQuarterNum(toDate('1970-01-01'))
    ) AS init
SELECT
    -- build the date from the relative quarter number
    toDate('1970-01-01') + toIntervalQuarter(q - init) AS time,
    metric
FROM
(
    SELECT
        toRelativeQuarterNum(created_at) AS q,
        sum(rand()) AS metric
    FROM
    (
        -- generate some dates and metrics values with gaps
        SELECT toDate(arrayJoin(range(1514761200, 1546210800, ((60 * 60) * 24) * 180))) AS created_at
    )
    GROUP BY q
    ORDER BY q ASC WITH FILL FROM toRelativeQuarterNum(toDate(1514761200)) TO toRelativeQuarterNum(toDate(1546210800)) STEP 1
)

┌───────time─┬─────metric─┐
│ 2018-01-01 │ 2950782089 │
│ 2018-04-01 │ 2972073797 │
│ 2018-07-01 │          0 │
│ 2018-10-01 │  179581958 │
└────────────┴────────────┘
czfnxgou

czfnxgou3#

作为 numbers() 某些情况下的功能 range 数组函数也很有用。
示例:对于每对(id1,id2),应生成前7天的日期。

SELECT
  id1,
  id2,
  arrayJoin(
    arrayMap( x -> today() - 7 + x, range(7) )
  ) as date2
FROM table
WHERE date >= now() - 7
GROUP BY id1, id2

该选择的结果可以在union all中用于填充数据中的“孔”。

SELECT id1, id2, date, sum(column1)
FROM (
  SELECT
    id1,
    id2,
    date,
    column1 
  FROM table
  WHERE date >= now() - 7

  UNION ALL 

  SELECT
    id1,
    id2,
    arrayJoin(
      arrayMap( x -> today() - 7 + x, range(7) )
    ) as date2,
    0 as column1
  FROM table
  WHERE date >= now() - 7
  GROUP BY id1, id2
)
GROUP BY id1, id2, date
ORDER BY date, id1, id2
2nbm6dog

2nbm6dog4#

这是我如何做到的小时桶(需要在格拉法纳形象化),感谢@filimonov和@mikhail

SELECT t, SUM(metric) as metric FROM (
    SELECT 
        arrayJoin(
          arrayMap( x -> toStartOfHour(addHours(toDateTime($from),x)),
              range(toUInt64(
                  dateDiff('hour', 
                      toDateTime($from), 
                      toDateTime($to)) + 1)))
        ) as t,
        0 as metric

    UNION ALL

    SELECT
        toStartOfHour(my_date) as t,
        COUNT(metric)
        FROM my_table
        WHERE t BETWEEN toDateTime($from) AND toDateTime($to)
        GROUP BY t
)
GROUP BY t ORDER BY t

例如,对于2019-01-01到2019-01-02的范围,它将为您提供:

SELECT t, SUM(metric) as metric FROM (
    SELECT 
        arrayJoin(
          arrayMap( x -> toStartOfHour(addHours(toDateTime('2019-01-01 00:00:00'),x)),
              range(toUInt64(
                  dateDiff('hour', 
                      toDateTime('2019-01-01 00:00:00'), 
                      toDateTime('2019-01-02 00:00:00')) + 1)))
        ) as t,
        0 as metric

    UNION ALL

    SELECT
        toStartOfHour(my_date) as t,
        COUNT(1) as metric
        FROM my_table
        WHERE t BETWEEN toDateTime('2019-01-01 00:00:00') AND toDateTime('2019-01-02 00:00:00')
        GROUP BY t
)
GROUP BY t ORDER BY t;
t                  |metric|
-------------------|------|
2019-01-01 00:00:00|     0|
2019-01-01 01:00:00|     0|
2019-01-01 02:00:00|     0|
2019-01-01 03:00:00|     0|
2019-01-01 04:00:00|     0|
2019-01-01 05:00:00|     0|
2019-01-01 06:00:00|     0|
2019-01-01 07:00:00|105702|
2019-01-01 08:00:00|113315|
2019-01-01 09:00:00|149837|
2019-01-01 10:00:00|185314|
2019-01-01 11:00:00|246106|
2019-01-01 12:00:00|323036|
2019-01-01 13:00:00|     0|
2019-01-01 14:00:00|409160|
2019-01-01 15:00:00|379113|
2019-01-01 16:00:00|256634|
2019-01-01 17:00:00|286601|
2019-01-01 18:00:00|280039|
2019-01-01 19:00:00|248504|
2019-01-01 20:00:00|218642|
2019-01-01 21:00:00|186152|
2019-01-01 22:00:00|148478|
2019-01-01 23:00:00|109721|
2019-01-02 00:00:00|     0|

相关问题