postgresql 使用Order By计算分区中的行数

s3fp2yjn  于 2023-03-01  发布在  PostgreSQL
关注(0)|答案(3)|浏览(147)

我试图通过编写一些示例查询来理解postgres中的PARTITION BY。我有一个测试表,我在上面运行了我的查询。

id integer | num integer
___________|_____________
1          | 4 
2          | 4
3          | 5
4          | 6

当我运行下面的查询时,我得到了预期的输出。

SELECT id, COUNT(id) OVER(PARTITION BY num) from test;

id         | count
___________|_____________
1          | 2 
2          | 2
3          | 1
4          | 1

但是,当我向分区添加ORDER BY时,

SELECT id, COUNT(id) OVER(PARTITION BY num ORDER BY id) from test;

id         | count
___________|_____________
1          | 1 
2          | 2
3          | 1
4          | 1

我的理解是,COUNT是对属于一个分区的所有行进行计算的。在这里,我按num对行进行了分区。分区中的行数是相同的,不管有没有ORDER BY子句。为什么输出会有差异?

xoefb8l8

xoefb8l81#

当您将order by添加到用作窗口函数的聚合时,该聚合将变成“运行计数”(或您使用的任何聚合)。
count(*)将根据指定的顺序返回“当前行”之前的行数。
下面的查询显示了使用order by的聚合的不同结果。使用sum()而不是count(),更容易看到(在我看来)。

with test (id, num, x) as (
  values 
    (1, 4, 1),
    (2, 4, 1),
    (3, 5, 2),
    (4, 6, 2)
)
select id, 
       num,
       x,
       count(*) over () as total_rows, -- NB over () is needed
       count(*) over (order by id) as rows_upto,
       count(*) over (partition by x order by id) as rows_per_x,
       sum(num) over (partition by x) as total_for_x,
       sum(num) over (order by id) as sum_upto,
       sum(num) over (partition by x order by id) as sum_for_x_upto
from test;

将导致:

id | num | x | total_rows | rows_upto | rows_per_x | total_for_x | sum_upto | sum_for_x_upto
---+-----+---+------------+-----------+------------+-------------+----------+---------------
 1 |   4 | 1 |          4 |         1 |          1 |           8 |        4 |              4
 2 |   4 | 1 |          4 |         2 |          2 |           8 |        8 |              8
 3 |   5 | 2 |          4 |         3 |          1 |          11 |       13 |              5
 4 |   6 | 2 |          4 |         4 |          2 |          11 |       19 |             11

Postgres manual中有更多示例

fd3cxomn

fd3cxomn2#

你的两个表达式是:

COUNT(id) OVER (PARTITION BY num)

COUNT(id) OVER (PARTITION BY num ORDER BY id)

为什么你会期望它们返回相同的值呢?语法不同是有原因的。
第一个函数返回每个num的总计数--本质上是将聚合值连接回来。
第二个函数进行累积计数,它对id的每一行以及id之前的所有值进行COUNT()计数。
请注意,此类累计计数通常使用RANK()(或相关函数)实现。累计计数与RANK()略有不同。累计计数实现:

COUNT(id) OVER (PARTITION BY num ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

RANK()稍有不同,只有当ORDER BY键有关系时,这种差异才有意义。

os8fio9y

os8fio9y3#

“为什么”已经被其他人解释过了。有时你有一个有序的窗口,尽管有一个ORDER BY,你还是必须对整个分区进行计数。
为此,请使用RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING的无界范围

create table search_log
(
    id bigint not null primary key,
    query varchar(255) not null,
    stemmed_query varchar(255) not null,
    created timestamp not null,
);

SELECT query,
       created as seen_on,
       first_value(created) OVER query_window as last_seen,
       row_number() OVER query_window AS rn,
       count(*) OVER query_window AS occurence
FROM search_log l
     WINDOW query_window AS (PARTITION BY stemmed_query ORDER BY created DESC 
         RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

相关问题