为什么PostgreSQL在这种情况下不使用索引?

bgibtngc  于 2023-05-22  发布在  PostgreSQL
关注(0)|答案(2)|浏览(161)

我在PostgreSQL 14.6中有几个表,索引如下所示:

# \d uprn
                        Table "public.uprn"
  Column   |         Type         | Collation | Nullable | Default
-----------+----------------------+-----------+----------+---------
 uprn      | text                 |           |          |
 postcode  | text                 |           |          |
 location  | geometry(Point,4326) |           |          |
Indexes:
    "uprn_location" gist (location)
    "uprn_postcode" btree (postcode)
    "uprn_uprn" btree (uprn)

# \d geometry_subdivided
      Table "public.geometry_subdivided"
 Column  |   Type   | Collation | Nullable | Default
---------+----------+-----------+----------+---------
 area_id | integer  |           |          |
 geom    | geometry |           |          |
Indexes:
    "subdivided_area_id" btree (area_id)
    "subdivided_geom_id" gist (geom)

这几行(VACUUM ANALYZE已经在导入后完成):

# select count(*) from uprn;
  count
----------
 32872945
(1 row)

# select count(*) from geometry_subdivided;
 count
--------
 938500
(1 row)

uprn行中的23行具有“M19 1TF”的postcode。如果我在该邮政编码上运行一个连接这两个表的查询,寻找覆盖这些点的区域,使用序列扫描需要几秒钟:

# explain analyze select area_id,count(*) from uprn u, geometry_subdivided g where st_covers(g.geom, st_transform(u.location,27700)) and postcode='M19 1TF' group by area_id;
                                                                                       QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=6287745.55..6292743.36 rows=16428 width=12) (actual time=6273.505..6290.003 rows=16 loops=1)
   Group Key: g.area_id
   ->  Gather Merge  (cost=6287745.55..6292414.80 rows=32856 width=12) (actual time=6273.497..6289.990 rows=48 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial GroupAggregate  (cost=6286745.52..6287622.38 rows=16428 width=12) (actual time=6254.442..6254.452 rows=16 loops=3)
               Group Key: g.area_id
               ->  Sort  (cost=6286745.52..6286983.05 rows=95010 width=4) (actual time=6254.431..6254.434 rows=69 loops=3)
                     Sort Key: g.area_id
                     Sort Method: quicksort  Memory: 25kB
                     Worker 0:  Sort Method: quicksort  Memory: 33kB
                     Worker 1:  Sort Method: quicksort  Memory: 25kB
                     ->  Nested Loop  (cost=25.29..6278890.20 rows=95010 width=4) (actual time=4756.836..6254.376 rows=69 loops=3)
                           ->  Parallel Seq Scan on uprn u  (cost=0.00..1264850.55 rows=101 width=32) (actual time=4725.962..6221.730 rows=8 loops=3)
                                 Filter: (postcode = 'M19 1TF'::text)
                                 Rows Removed by Filter: 10957641
                           ->  Index Scan using subdivided_geom_id on geometry_subdivided g  (cost=25.29..49643.02 rows=94 width=2040) (actual time=0.102..0.253 rows=9 loops=23)
                                 Index Cond: (geom ~ st_transform(u.location, 27700))
                                 Filter: st_covers(geom, st_transform(u.location, 27700))
                                 Rows Removed by Filter: 7
 Planning Time: 0.359 ms
 Execution Time: 6290.100 ms
(22 rows)

但是如果我不鼓励PostgreSQL使用序列扫描,同样的查询需要毫秒:

# set enable_seqscan to off;
SET
# explain analyze select area_id,count(*) from uprn u, geometry_subdivided g where st_covers(g.geom, st_transform(u.location,27700)) and postcode='M19 1TF' group by area_id;
                                                                                 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=12076657.05..12078536.73 rows=17175 width=12) (actual time=7.710..7.766 rows=16 loops=1)
   Group Key: g.area_id
   ->  Sort  (cost=12076657.05..12077226.36 rows=227724 width=4) (actual time=7.681..7.714 rows=207 loops=1)
         Sort Key: g.area_id
         Sort Method: quicksort  Memory: 34kB
         ->  Nested Loop  (cost=31.60..12053278.12 rows=227724 width=4) (actual time=0.203..7.603 rows=207 loops=1)
               ->  Bitmap Heap Scan on uprn u  (cost=6.31..966.53 rows=242 width=32) (actual time=0.035..0.073 rows=23 loops=1)
                     Recheck Cond: (postcode = 'M19 1TF'::text)
                     Heap Blocks: exact=10
                     ->  Bitmap Index Scan on uprn_postcode  (cost=0.00..6.25 rows=242 width=0) (actual time=0.024..0.025 rows=23 loops=1)
                           Index Cond: (postcode = 'M19 1TF'::text)
               ->  Index Scan using subdivided_geom_id on geometry_subdivided g  (cost=25.29..49802.00 rows=94 width=2038) (actual time=0.116..0.322 rows=9 loops=23)
                     Index Cond: (geom ~ st_transform(u.location, 27700))
                     Filter: st_covers(geom, st_transform(u.location, 27700))
                     Rows Removed by Filter: 7
 Planning Time: 0.259 ms
 Execution Time: 7.851 ms
(17 rows)

为什么PostgreSQL在普通查询中不使用索引,有没有一种方法可以改写查询来做到这一点?我试过很多种组合,但到目前为止都没有成功

h79rfbju

h79rfbju1#

我尝试过的一种可能性是将uprn表查找移动到表定义中(因为select * from uprn where postcode='M19 1TF'本身使用索引),但这没有什么区别:

# explain analyze select area_id,count(*) from (select * from uprn where postcode='M19 1TF') u, geometry_subdivided g where st_covers(g.geom, st_transform(u.location,27700)) group by area_id;
                                                                                       QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=6303711.22..6308902.95 rows=17175 width=12) (actual time=7348.913..7353.635 rows=16 loops=1)
   Group Key: g.area_id
   ->  Gather Merge  (cost=6303711.22..6308559.45 rows=34350 width=12) (actual time=7348.908..7353.626 rows=34 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial GroupAggregate  (cost=6302711.20..6303594.59 rows=17175 width=12) (actual time=7329.626..7329.635 rows=11 loops=3)
               Group Key: g.area_id
               ->  Sort  (cost=6302711.20..6302948.41 rows=94885 width=4) (actual time=7329.615..7329.619 rows=69 loops=3)
                     Sort Key: g.area_id
                     Sort Method: quicksort  Memory: 25kB
                     Worker 0:  Sort Method: quicksort  Memory: 25kB
                     Worker 1:  Sort Method: quicksort  Memory: 33kB
                     ->  Nested Loop  (cost=25.29..6294867.11 rows=94885 width=4) (actual time=5698.604..7329.564 rows=69 loops=3)
                           ->  Parallel Seq Scan on uprn  (cost=0.00..1264770.12 rows=101 width=32) (actual time=5669.212..7298.298 rows=8 loops=3)
                                 Filter: (postcode = 'M19 1TF'::text)
                                 Rows Removed by Filter: 10957641
                           ->  Index Scan using subdivided_geom_id on geometry_subdivided g  (cost=25.29..49802.00 rows=94 width=2038) (actual time=0.109..0.270 rows=9 loops=23)
                                 Index Cond: (geom ~ st_transform(uprn.location, 27700))
                                 Filter: st_covers(geom, st_transform(uprn.location, 27700))
                                 Rows Removed by Filter: 7
 Planning Time: 0.460 ms
 Execution Time: 7353.724 ms

但是,如果我向内部查询添加ORDER BY,它就会使用索引!

# explain analyze select area_id,count(*) from (select * from uprn where postcode='M19 1TF' order by postcode) u, geometry_subdivided g where st_covers(g.geom, st_transform(u.location,27700)) group by area_id;
                                                                                 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=12052047.47..12053927.15 rows=17175 width=12) (actual time=7.878..7.921 rows=16 loops=1)
   Group Key: g.area_id
   ->  Sort  (cost=12052047.47..12052616.78 rows=227724 width=4) (actual time=7.861..7.881 rows=207 loops=1)
         Sort Key: g.area_id
         Sort Method: quicksort  Memory: 34kB
         ->  Nested Loop  (cost=31.60..12028668.54 rows=227724 width=4) (actual time=0.328..7.778 rows=207 loops=1)
               ->  Bitmap Heap Scan on uprn  (cost=6.31..966.53 rows=242 width=104) (actual time=0.064..0.102 rows=23 loops=1)
                     Recheck Cond: (postcode = 'M19 1TF'::text)
                     Heap Blocks: exact=10
                     ->  Bitmap Index Scan on uprn_postcode  (cost=0.00..6.25 rows=242 width=0) (actual time=0.045..0.046 rows=23 loops=1)
                           Index Cond: (postcode = 'M19 1TF'::text)
               ->  Index Scan using subdivided_geom_id on geometry_subdivided g  (cost=25.29..49700.30 rows=94 width=2038) (actual time=0.118..0.327 rows=9 loops=23)
                     Index Cond: (geom ~ st_transform(uprn.location, 27700))
                     Filter: st_covers(geom, st_transform(uprn.location, 27700))
                     Rows Removed by Filter: 7
 Planning Time: 0.390 ms
 Execution Time: 8.027 ms
(17 rows)

这回答了我的问题,使它的性能,这是主要的关注,虽然我不明白为什么PostgreSQL不使用索引摆在首位。

uqcuzwp8

uqcuzwp82#

在一个较浅的层次上,很容易看到正在发生的事情。它认为几何索引扫描是计划中开销最大的部分,并且它认为使用seq扫描是并行执行索引扫描的关键。因此,seq扫描的较高成本被认为是值得的,以获得并行化。所以,如果你通常没有从并行化中获得太多好处,这个问题的一个简单的解决方案是设置max_parallel_workers_per_gather = 0。
在更深的层次上,更难看到。为什么它认为序列扫描是平行计划的关键?“并行位图堆扫描”在这个版本中确实存在,那么为什么不使用它呢?我认为这是因为该表中的预期行数足够低(242),位图不能轻易地进行并行化划分,而seq扫描更容易划分。(我没有查看源代码来验证这个理论)。如果预期的行数更高,它将使用“并行位图堆扫描”,如果它低得多,它不会认为首先将其并行化是重要的。
因此,另一种可能的解决方案是修复估计问题,即它认为邮政编码将找到242行,但只找到23行。您应该能够通过增加default_statistics_target,或仅增加该列的统计参数,然后重新执行ANALYZE来实现这一点。但我不确定仅仅把它减少到23行就足以放弃平行计划。
行估计问题的另一半是几何索引的每次迭代都认为它会找到94行,但实际上只找到9行。但是,您对此无能为力,因为这种类型的几何索引扫描不会查看数据分布,它总是假设它将返回表的1/10000。
至于你对子选择的尝试,计划者“看穿”了这个公式,所以提出了同样的计划。通常,这种透视会导致更好的计划,但在这里,它会抑制你强制实施更好计划的尝试。为了避免这种情况,您可以使用实体化的CTE来禁止这种计划器透视。

with u as materialized (select * from uprn where postcode='M19 1TF')
select area_id,count(*) from u, geometry_subdivided g where st_covers(g.geom, st_transform(u.location,27700)) group by area_id;

无论如何,看起来postgis的下一个版本3.4.0将改变gist索引扫描成本的估计方式,在这种情况下,可能会解决你的问题(提交31 bcb 7 d414 c73 df 8dbc 2975 c6 dd 4a 269 b190 c874)

相关问题