postgresql—按行在sql中的第一个值对一系列行进行分组

z6psavjg  于 2021-07-24  发布在  Java
关注(0)|答案(2)|浏览(342)

如何根据sql中每个序列的第一个值对序列数据集进行分组?
例如,我有以下数据集

id  name  key  metric
1   alice a    0   <- key = 'a', start of a sequence
2   alice b    1
3   alice b    1
-----------------
4   alice a    1   <- key = 'a', start of a sequence
5   alice b    0
6   alice b    0
7   alice b    0
-----------------
8   bob   a    1   <- key = 'a', start of a sequence
9   bob   b    1
-----------------
10  bob   a    0   <- key = 'a', start of a sequence

带的行 key = 'a' 创建新组。例如,我想对所有后续行的度量求和,直到到达另一行 key = 'a' 或者别的 name .
数据集按 id .
最终结果应该是:

id  name   metric
1   alice  2
4   alice  1
8   bob    2
10  bob    0

下面是javascript中的等效操作,但我希望能够通过sql查询得到相同的结果。

data.reduce((acc, a) => {
    if(a.key === 'a'){
      // key = 'a' starts a new group
      return [{id: a.id, name: a.name, metric: a.metric}].concat(acc)
    } else {
      // because the data is sorted, 
      // all the subsequent rows with key = 'b' belong to the latest group
      const [head, ...tail] = acc
      const head_updated = {...head, metric: head.metric + a.metric}
      return [head_updated, ...tail]
    }
  }, [])
  .reverse()

示例sql数据集:

with dataset as (
  select 
    1       as id
  , 'alice' as name
  , 'a'     as key
  , 0       as metric
  union select
    2       as id
  , 'alice' as name
  , 'b'     as key
  , 1       as metric
  union select
    3       as id
  , 'alice' as name
  , 'b'     as key
  , 1       as metric
  union select 
    4       as id
  , 'alice' as name
  , 'a'     as key
  , 1       as metric
  union select
    5       as id
  , 'alice' as name
  , 'b'     as key
  , 0       as metric
  union select
    6       as id
  , 'alice' as name
  , 'b'     as key
  , 0       as metric
  union select
    7       as id
  , 'alice' as name
  , 'b'     as key
  , 0       as metric
  union select
    8       as id
  , 'bob'   as name
  , 'a'     as key
  , 1       as metric
  union select
    9       as id
  , 'bob'   as name
  , 'b'     as key
  , 1       as metric
  union select
    10      as id
  , 'bob'   as name
  , 'a'     as key
  , 0       as metric
)

select * from dataset
order by name, id
rkkpypqq

rkkpypqq1#

你可以使用窗口功能 sum() 要创建组并进行聚合,请执行以下操作:

select min(id) id, name, sum(metric) metric
from (
  select *, sum((key = 'a')::int) over (partition by name order by id) grp 
  from dataset
) t
group by name, grp
order by id

请看演示。
结果:

> id | name  | metric
> -: | :---- | -----:
>  1 | alice |      2
>  4 | alice |      1
>  8 | bob   |      2
> 10 | bob   |      0
inkz8wg9

inkz8wg92#

根据op在评论中写的内容,查询必须是这样的:

SELECT MAX(t.head_id) AS id,
       t.head_name AS name,
       SUM(t.metric) AS metric
FROM (
    SELECT SUM(CASE WHEN key = 'a' THEN 1 END) OVER (PARTITION BY name ORDER BY id) AS group_id,
           CASE WHEN key = 'a' THEN id END AS head_id,
           name AS head_name,
           metric
    FROM dataset
) t
GROUP BY t.head_name, t.group_id

但是,如果您可以按名称和id添加索引,则可以真正提高查询的性能。这是因为它在聚合之前不需要排序操作。
使用一个有一百万行的表进行测试,这是explain analyze的输出,不带索引:

HashAggregate  (cost=177154.34..177158.34 rows=400 width=25) (actual time=3374.878..3489.755 rows=400000 loops=1)
  Group Key: dataset.name, sum(CASE WHEN (dataset.key = 'a'::text) THEN 1 ELSE NULL::integer END) OVER (?)
  ->  WindowAgg  (cost=132154.34..157154.34 rows=1000000 width=25) (actual time=1920.338..3000.218 rows=1000000 loops=1)
        ->  Sort  (cost=132154.34..134654.34 rows=1000000 width=15) (actual time=1920.323..2232.936 rows=1000000 loops=1)
              Sort Key: dataset.name, dataset.id
              Sort Method: external merge  Disk: 28192kB
              ->  Seq Scan on dataset  (cost=0.00..15406.00 rows=1000000 width=15) (actual time=0.020..172.746 rows=1000000 loops=1)

Planning Time: 0.870 ms
Execution Time: 3516.726 ms

通过创建索引,查询计划将更改为以下内容:
索引:

CREATE INDEX dataset__name_id__idx ON dataset(name, id);

查询计划:

HashAggregate  (cost=90169.90..90173.90 rows=400 width=25) (actual time=1464.759..1567.778 rows=400000 loops=1)
  Group Key: dataset.name, sum(CASE WHEN (dataset.key = 'a'::text) THEN 1 ELSE NULL::integer END) OVER (?)
  ->  WindowAgg  (cost=0.42..70169.90 rows=1000000 width=25) (actual time=0.033..1077.362 rows=1000000 loops=1)
        ->  Index Scan using dataset__name_id__idx on dataset  (cost=0.42..47669.90 rows=1000000 width=15) (actual time=0.022..225.445 rows=1000000 loops=1)

Planning Time: 0.131 ms
Execution Time: 1590.040 ms

古老的答案

基于javascript代码,您不希望按 name ,或分组依据 name 在外部查询中。否则,您实际上会以一个更好的查询结束,该查询只允许您使用主索引,假设 id 列已编入索引。

SELECT t.head_id AS id,
       MAX(t.head_name) AS name,
       SUM(t.metric) AS metric
FROM (
        SELECT MAX(CASE WHEN key = 'a' THEN id END) OVER (ORDER BY id) AS head_id,
               CASE WHEN key = 'a' THEN name END AS head_name,
               metric
        FROM dataset
    ) t
GROUP BY t.head_id

下面是一个 dataset 有100万行:

HashAggregate  (cost=68889.43..68891.43 rows=200 width=44) (actual time=1277.469..1393.709 rows=400000 loops=1)
  Group Key: max(CASE WHEN (dataset.key = 'a'::text) THEN dataset.id ELSE NULL::integer END) OVER (?)
  ->  WindowAgg  (cost=0.42..51389.43 rows=1000000 width=44) (actual time=0.025..927.595 rows=1000000 loops=1)
        ->  Index Scan using dataset_pkey on dataset  (cost=0.42..31389.42 rows=1000000 width=15) (actual time=0.017..209.657 rows=1000000 loops=1)

Planning Time: 0.127 ms
Execution Time: 1411.975 ms

相关问题