我在PostgreSQL 15.3中有这个表(对应于Django模型):
Table "public.myapp1_task"
Column | Type | Collation | Nullable | Default
-------------------------+--------------------------+-----------+----------+-------------------------------------------------
id | bigint | | not null | nextval('myapp1_task_id_seq'::regclass)
created_at | timestamp with time zone | | not null |
updated_at | timestamp with time zone | | not null |
kind | character varying(12) | | not null |
status | character varying(12) | | not null |
environment | character varying(7) | | not null |
data | jsonb | | not null |
result | jsonb | | not null |
sent_at | timestamp with time zone | | |
response_at | timestamp with time zone | | |
priority | smallint | | not null |
sequence | integer | | |
result_attachment | character varying(100) | | not null |
taxes | jsonb | | not null |
myapp2_item_id | bigint | | |
source | character varying(8) | | not null |
user_id | bigint | | |
custom_actions | jsonb | | not null |
Indexes:
"myapp1_task_pkey" PRIMARY KEY, btree (id)
"myapp1_task_user_id_76a104e9" btree (user_id)
"myapp1_task_myapp2_item_idd_441d91cb" btree (myapp2_item_id)
"sequence_idx" btree (sequence DESC NULLS LAST)
"sequence_mc_idx" btree (sequence, myapp2_item_id DESC NULLS LAST)
字符串
Goals:对于每个myapp2_item_id
,查找序列最高的行。
我添加了与sequence
列相关的最后两个索引。
使用Django ORM,我试图过滤一个查询集,代码如下:
queryset = Task.objects.all()
sequences = queryset.filter(item=OuterRef("item")).exclude(sequence__isnull=True).order_by("-sequence").distinct().values("sequence")
max_sequences = sequences.annotate(max_seq=Max("sequence")).values("max_seq")[:1]
filtered_queryset = queryset.filter(sequence=Subquery(max_sequences))
print(filtered_queryset.query)
型
它会将其转换成这个SQL语句。请注意具有group by
和max
聚合的子查询:
SELECT "myapp1_task"."id"
FROM "myapp1_task"
LEFT OUTER JOIN "myapp2_item"
ON ("myapp1_task"."myapp2_item_id" = "myapp2_item"."id")
LEFT OUTER JOIN "myapp2_user" ON ("myapp2_item"."user_id" = "myapp2_user"."id")
LEFT OUTER JOIN "myapp2_category"
ON ("myapp2_item"."myapp2_category_id" = "myapp2_category"."id")
LEFT OUTER JOIN "myapp2_user" T5 ON ("myapp1_task"."user_id" = T5."id")
WHERE "myapp1_task"."sequence" = (SELECT "subquery"."max_seq"
FROM (
SELECT MAX(U0."sequence") AS "max_seq", U0."sequence"
FROM "myapp1_task" U0
WHERE (U0."myapp2_item_id" =
("myapp1_task"."myapp2_item_id"))
GROUP BY U0."sequence"
ORDER BY U0."sequence" DESC
LIMIT 1) subquery)
型
可悲的是,它在相当大的表(>1M行)上非常慢。检查explain
结果,我在子查询上得到了这个-> seq scan
,所以没有使用任何新索引:
Seq Scan on myapp1_task (cost=0.00..5525.25 rows=3 width=8)
Filter: (sequence = (SubPlan 1))
SubPlan 1
-> Subquery Scan on subquery (cost=8.30..8.33 rows=1 width=4)
-> Limit (cost=8.30..8.32 rows=1 width=8)
-> GroupAggregate (cost=8.30..8.32 rows=1 width=8)
Group Key: u0.sequence
-> Sort (cost=8.30..8.31 rows=1 width=4)
Sort Key: u0.sequence DESC
-> Index Scan using myapp1_task_myapp2_item_idd_441d91cb on myapp1_task u0 (cost=0.28..8.29 rows=1 width=4)
Index Cond: (myapp2_item_id = myapp1_task.myapp2_item_id)
型
不知道我做错了什么。如何改善这种情况?
1条答案
按热度按时间0yg35tkg1#
您或您的ORM(或两者)已经扭曲和混淆了SQL语句,以至于任何RDBMS都很难从中提取有效的查询计划。在删除了大量的cruft之后,(等效的)语句如下:
字符串
(
GROUP BY
和MAX
在原始中是无用的噪声。WHERE
子句中的相关子查询从myapp1_task
中过滤行,其中sequence
在具有相同myapp2_item_id
的行中以降序排序第一-以非常昂贵的方式。由于您特殊的查询和表定义,任何myapp2_item_id
或sequence
是null
的行,或者任何其他具有相同myapp2_item_id
和sequence IS NULL
的行都将被删除。所有
LEFT JOIN
行都只是噪声,因为SELECT
列表无论如何只返回myapp1_task.id
。这些连接的唯一可能的效果是,如果左侧有重复项,则会增加行,这似乎是不太可能的奋进。溶液
你后来补充说:
**目标:**对于每个
myapp2_item_id
,找到具有最高序列的行。仍然没有阐明如何处理
null
值。也不知道如何处理重复。也不知道该返回什么。这些都很重要假设:
(myapp2_item_id, sequence)
实际上是UNIQUE
。SELECT *
)-这通常是浪费的废话。然后查询归结为(!):
型
请参阅:
关于
DESC NULLS LAST
:索引
此查询的最佳索引是具有匹配排序顺序leading
myapp2_item_id
的多列索引:(myapp2_item_id, sequence DESC NULLS LAST)
.对于
myapp2_item_id
中每个值只有几行的情况,索引不会有太大帮助,尤其是对于SELECT *
。顺序扫描可以一样快或更快。随着每组行数的增加,索引变得更加有用。对于较大的数字,特殊的查询技术是上级的。请参阅:Postgres 16将于2023年底在该领域进行一些性能优化。