postgresql 具有GROUP BY和MAX的子查询将不使用索引

wj8zmpe1  于 2023-08-04  发布在  PostgreSQL
关注(0)|答案(1)|浏览(228)

我在PostgreSQL 15.3中有这个表(对应于Django模型):

Table "public.myapp1_task"
         Column          |           Type           | Collation | Nullable |                     Default
-------------------------+--------------------------+-----------+----------+-------------------------------------------------
 id                      | bigint                   |           | not null | nextval('myapp1_task_id_seq'::regclass)
 created_at              | timestamp with time zone |           | not null |
 updated_at              | timestamp with time zone |           | not null |
 kind                    | character varying(12)    |           | not null |
 status                  | character varying(12)    |           | not null |
 environment             | character varying(7)     |           | not null |
 data                    | jsonb                    |           | not null |
 result                  | jsonb                    |           | not null |
 sent_at                 | timestamp with time zone |           |          |
 response_at             | timestamp with time zone |           |          |
 priority                | smallint                 |           | not null |
 sequence                | integer                  |           |          |
 result_attachment       | character varying(100)   |           | not null |
 taxes                   | jsonb                    |           | not null |
 myapp2_item_id          | bigint                   |           |          |
 source                  | character varying(8)     |           | not null |
 user_id                 | bigint                   |           |          |
 custom_actions          | jsonb                    |           | not null |
 
Indexes:
    "myapp1_task_pkey" PRIMARY KEY, btree (id)
    "myapp1_task_user_id_76a104e9" btree (user_id)
    "myapp1_task_myapp2_item_idd_441d91cb" btree (myapp2_item_id)
    "sequence_idx" btree (sequence DESC NULLS LAST)
    "sequence_mc_idx" btree (sequence, myapp2_item_id DESC NULLS LAST)

字符串

Goals:对于每个myapp2_item_id,查找序列最高的行。

我添加了与sequence列相关的最后两个索引。
使用Django ORM,我试图过滤一个查询集,代码如下:

queryset = Task.objects.all()
sequences = queryset.filter(item=OuterRef("item")).exclude(sequence__isnull=True).order_by("-sequence").distinct().values("sequence")
max_sequences = sequences.annotate(max_seq=Max("sequence")).values("max_seq")[:1]
filtered_queryset = queryset.filter(sequence=Subquery(max_sequences))
print(filtered_queryset.query)


它会将其转换成这个SQL语句。请注意具有group bymax聚合的子查询:

SELECT "myapp1_task"."id"
FROM "myapp1_task"
         LEFT OUTER JOIN "myapp2_item"
                         ON ("myapp1_task"."myapp2_item_id" = "myapp2_item"."id")
         LEFT OUTER JOIN "myapp2_user" ON ("myapp2_item"."user_id" = "myapp2_user"."id")
         LEFT OUTER JOIN "myapp2_category"
                         ON ("myapp2_item"."myapp2_category_id" = "myapp2_category"."id")
         LEFT OUTER JOIN "myapp2_user" T5 ON ("myapp1_task"."user_id" = T5."id")
WHERE "myapp1_task"."sequence" = (SELECT "subquery"."max_seq"
                                          FROM (
                                          SELECT MAX(U0."sequence") AS "max_seq", U0."sequence"
                                                FROM "myapp1_task" U0
                                                WHERE (U0."myapp2_item_id" =
                                                       ("myapp1_task"."myapp2_item_id"))
                                                GROUP BY U0."sequence"
                                                ORDER BY U0."sequence" DESC
                                                LIMIT 1) subquery)


可悲的是,它在相当大的表(>1M行)上非常慢。检查explain结果,我在子查询上得到了这个-> seq scan,所以没有使用任何新索引:

Seq Scan on myapp1_task  (cost=0.00..5525.25 rows=3 width=8)
  Filter: (sequence = (SubPlan 1))
  SubPlan 1
    ->  Subquery Scan on subquery  (cost=8.30..8.33 rows=1 width=4)
          ->  Limit  (cost=8.30..8.32 rows=1 width=8)
                ->  GroupAggregate  (cost=8.30..8.32 rows=1 width=8)
                      Group Key: u0.sequence
                      ->  Sort  (cost=8.30..8.31 rows=1 width=4)
                            Sort Key: u0.sequence DESC
                            ->  Index Scan using myapp1_task_myapp2_item_idd_441d91cb on myapp1_task u0  (cost=0.28..8.29 rows=1 width=4)
                                  Index Cond: (myapp2_item_id = myapp1_task.myapp2_item_id)


不知道我做错了什么。如何改善这种情况?

0yg35tkg

0yg35tkg1#

您或您的ORM(或两者)已经扭曲和混淆了SQL语句,以至于任何RDBMS都很难从中提取有效的查询计划。在删除了大量的cruft之后,(等效的)语句如下:

SELECT t.id
FROM   myapp1_task t
LEFT   JOIN myapp2_item i     ON t.myapp2_item_id = i.id
LEFT   JOIN myapp2_user iu    ON i.user_id = iu.id
LEFT   JOIN myapp2_category c ON i.myapp2_category_id = c.id
LEFT   JOIN myapp2_user tu    ON t.user_id = tu.id
WHERE  t.sequence = (
   SELECT t1.sequence
   FROM   myapp1_task t1
   WHERE  t1.myapp2_item_id = t.myapp2_item_id
   ORDER  BY t1.sequence DESC
   LIMIT  1
   );

字符串
GROUP BYMAX在原始中是无用的噪声。
WHERE子句中的相关子查询从myapp1_task中过滤行,其中sequence在具有相同myapp2_item_id的行中以降序排序第一-以非常昂贵的方式。由于您特殊的查询和表定义,任何myapp2_item_idsequencenull的行,或者任何其他具有相同myapp2_item_idsequence IS NULL的行都将被删除。
所有LEFT JOIN行都只是噪声,因为SELECT列表无论如何只返回myapp1_task.id。这些连接的唯一可能的效果是,如果左侧有重复项,则会增加行,这似乎是不太可能的奋进。

溶液

你后来补充说:

**目标:**对于每个myapp2_item_id,找到具有最高序列的行。

仍然没有阐明如何处理null值。也不知道如何处理重复。也不知道该返回什么。这些都很重要

假设:

  • 组合(myapp2_item_id, sequence)实际上是UNIQUE
  • 你想要最高的非空序列。
  • 您希望返回整行(SELECT *)-这通常是浪费的废话。

然后查询归结为(!):

SELECT DISTINCT ON (myapp2_item_id) *
FROM   myapp1_task
ORDER  BY myapp2_item_id, sequence DESC NULLS LAST;


请参阅:

  • 是否选择每个GROUP BY组中的第一行?

关于DESC NULLS LAST

  • 按列ASC排序,但首先是NULL值?

索引

此查询的最佳索引是具有匹配排序顺序leadingmyapp2_item_id的多列索引:(myapp2_item_id, sequence DESC NULLS LAST) .
对于myapp2_item_id中每个值只有几行的情况,索引不会有太大帮助,尤其是对于SELECT *。顺序扫描可以一样快或更快。随着每组行数的增加,索引变得更加有用。对于较大的数字,特殊的查询技术是上级的。请参阅:

  • SELECT DISTINCT比PostgreSQL中我的表的预期速度慢
  • 优化GROUP BY查询以检索每个用户的最新行
  • 是否选择每个GROUP BY组中的第一行?

Postgres 16将于2023年底在该领域进行一些性能优化。

相关问题