添加索引会降低我在PostgreSQL中的查询速度,但在SQLite中情况正好相反

f0brbegy  于 2022-11-15  发布在  SQLite
关注(0)|答案(1)|浏览(160)

我的PostgreSQL数据库(v13.8,在Debian 11(Bullseye)上运行)中有以下表,其中包含大约15,000,000行:

=> \d ncbitaxon
               Table "public.ncbitaxon"
   Column   |  Type   | Collation | Nullable | Default 
------------+---------+-----------+----------+---------
 assertion  | integer |           |          | 
 retraction | integer |           |          | 0
 graph      | text    |           |          | 
 subject    | text    |           |          | 
 predicate  | text    |           |          | 
 object     | text    |           |          | 
 datatype   | text    |           |          | 
 annotation | text    |           |          |

该表没有主键,因此每个subject字段可以与多个objectpredicate字段相关联。我想检索与predicate‘rdf:type’和object‘owl:Class’关联的所有主题,但这与predicate‘rdf:subClassOf’没有关联。
该表没有索引,当我运行以下查询时,我得到的结果一致不超过2秒(我认为这是可以接受的):

select n1.subject
  from ncbitaxon n1
 where n1.predicate = 'rdf:type'
   and n1.object = 'owl:Class'
   and not exists (
     select 1
       from ncbitaxon n2
      where n2.subject = n1.subject
        and n2.predicate = 'rdfs:subClassOf'
   )

但是,当我在subjectobjectpredicatedatatype列的每个列上添加(非唯一的)btree索引时,性能会显著降低,因此我的结果一致地返回约9秒(太慢了)。
我意识到索引不是灵丹妙药,有时甚至可以(显然)减慢SELECT查询的速度。
但令我困惑的是,当我在同一个表上运行相同的查询时,但这一次是在SQLite(v3.34.1,在同一台笔记本电脑上运行)中,我产生了相反的效果。也就是说,有了这些索引,我在大约5秒内就能得到结果,而没有这些索引,我最终不得不按Ctrl-C,因为我厌倦了等待结果返回。
我想知道这是否由于某种缓存效应,所以我尝试通过运行以下命令来清除PostgreSQL的缓存:

echo 1 > /proc/sys/vm/drop_caches
echo 2 > /proc/sys/vm/drop_caches
echo 3 > /proc/sys/vm/drop_caches
systemctl restart postgresql

而且确实有一些缓存效应,因为我第一次在PostgreSQL中运行查询时,大约需要3秒(而不是2秒)。但这仍然比在SQLite(~5s)或PostgreSQL(~9s)中运行查询更快,如果我有这些索引的话。
我很困惑。有人知道这是怎么回事吗?
CREATE INDEX语句为:

create index idx_77907_idx_ncbitaxon_predicate on ncbitaxon (predicate);
create index idx_77907_idx_ncbitaxon_subject on ncbitaxon (subject);
create index idx_77907_idx_ncbitaxon_object on ncbitaxon (object);
create index idx_77907_idx_ncbitaxon_datatype on ncbitaxon (datatype);

下面是使用索引的情况下explain (analyze, buffers, format text)的输出:

QUERY PLAN                                                                            
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.56..549629.43 rows=1 width=17) (actual time=5354.149..7223.752 rows=2 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=12556526 read=226206
   ->  Nested Loop Anti Join  (cost=0.56..548629.33 rows=1 width=17) (actual time=5942.986..7184.393 rows=1 loops=3)
         Buffers: shared hit=12556526 read=226206
         ->  Parallel Seq Scan on ncbitaxon n1  (cost=0.00..295443.22 rows=168032 width=17) (actual time=137.371..630.607 rows=812952 loops=3)
               Filter: ((predicate = 'rdf:type'::text) AND (object = 'owl:Class'::text))
               Rows Removed by Filter: 4250687
               Buffers: shared hit=6214 read=194286
         ->  Index Scan using idx_77907_idx_ncbitaxon_subject on ncbitaxon n2  (cost=0.56..3.46 rows=5 width=17) (actual time=0.008..0.008 rows=1 loops=2438855)
               Index Cond: (subject = n1.subject)
               Filter: (predicate = 'rdfs:subClassOf'::text)
               Rows Removed by Filter: 4
               Buffers: shared hit=12550312 read=31920
 Planning:
   Buffers: shared hit=14 read=6
 Planning Time: 1.054 ms
 JIT:
   Functions: 27
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 6.551 ms, Inlining 111.762 ms, Optimization 182.297 ms, Emission 117.506 ms, Total 418.115 ms
 Execution Time: 7228.950 ms
(23 rows)

以下是没有索引的情况:

------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=299030.49..603580.05 rows=1 width=17) (actual time=1461.791..1500.212 rows=2 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=25757 read=375369, temp read=22444 written=22996
   ->  Parallel Hash Anti Join  (cost=298030.49..602579.95 rows=1 width=17) (actual time=1425.751..1432.604 rows=1 loops=3)
         Hash Cond: (n1.subject = n2.subject)
         Buffers: shared hit=25757 read=375369, temp read=22444 written=22996
         ->  Parallel Seq Scan on ncbitaxon n1  (cost=0.00..295443.22 rows=168032 width=17) (actual time=0.021..447.123 rows=812952 loops=3)
               Filter: ((predicate = 'rdf:type'::text) AND (object = 'owl:Class'::text))
               Rows Removed by Filter: 4250687
               Buffers: shared hit=12868 read=187632
         ->  Parallel Hash  (cost=279619.35..279619.35 rows=1002811 width=17) (actual time=694.169..694.170 rows=812951 loops=3)
               Buckets: 65536  Batches: 64  Memory Usage: 2624kB
               Buffers: shared hit=12772 read=187728, temp written=11456
               ->  Parallel Seq Scan on ncbitaxon n2  (cost=0.00..279619.35 rows=1002811 width=17) (actual time=139.958..552.807 rows=812951 loops=3)
                     Filter: (predicate = 'rdfs:subClassOf'::text)
                     Rows Removed by Filter: 4250687
                     Buffers: shared hit=12772 read=187728
 Planning:
   Buffers: shared hit=5 dirtied=1
 Planning Time: 0.582 ms
 JIT:
   Functions: 39
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 6.349 ms, Inlining 93.709 ms, Optimization 198.947 ms, Emission 126.610 ms, Total 425.615 ms
 Execution Time: 1504.890 ms
(26 rows)

SQLite中explain的输出(带索引)为:

addr  opcode         p1    p2    p3    p4             p5  comment      
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     27    0                    0   Start at 27  
1     OpenRead       0     34019  0     6              0   root=34019 iDb=0; ncbitaxon
2     OpenRead       2     519613  0     k(2,,)         2   root=519613 iDb=0; idx_ncbitaxon_object
3     String8        0     1     0     owl:Class      0   r[1]='owl:Class'
4     SeekGE         2     26    1     1              0   key=r[1]     
5       IdxGT          2     26    1     1              0   key=r[1]     
6       DeferredSeek   2     0     0                    0   Move 0 to 2.rowid if needed
7       Column         0     4     2                    0   r[2]=ncbitaxon.predicate
8       Ne             3     25    2     BINARY-8       82  if r[2]!=r[3] goto 25
9       Integer        0     4     0                    0   r[4]=0; Init EXISTS result
10      Integer        1     5     0                    0   r[5]=1; LIMIT counter
11      OpenRead       1     34019  0     5              0   root=34019 iDb=0; ncbitaxon
12      OpenRead       3     332263  0     k(2,,)         2   root=332263 iDb=0; idx_ncbitaxon_subject
13      Column         0     3     6                    0   r[6]=ncbitaxon.subject
14      SeekGE         3     22    6     1              0   key=r[6]     
15        IdxGT          3     22    6     1              0   key=r[6]     
16        DeferredSeek   3     0     1                    0   Move 1 to 3.rowid if needed
17        Column         1     4     7                    0   r[7]=ncbitaxon.predicate
18        Ne             8     21    7     BINARY-8       82  if r[7]!=r[8] goto 21
19        Integer        1     4     0                    0   r[4]=1       
20        DecrJumpZero   5     22    0                    0   if (--r[5])==0 goto 22
21      Next           3     15    1                    0                
22      If             4     25    1                    0                
23      Column         0     3     10                   0   r[10]=ncbitaxon.subject
24      ResultRow      10    1     0                    0   output=r[10] 
25    Next           2     5     1                    0                
26    Halt           0     0     0                    0                
27    Transaction    0     0     77    0              1   usesStmtJournal=0
28    String8        0     3     0     rdf:type       0   r[3]='rdf:type'
29    String8        0     8     0     rdfs:subClassOf  0   r[8]='rdfs:subClassOf'
30    Goto           0     1     0                    0

最后,下面是SQLite中explain query plan的输出:

QUERY PLAN
|--SEARCH TABLE ncbitaxon AS n1 USING INDEX idx_ncbitaxon_object (object=?)
`--CORRELATED SCALAR SUBQUERY 1
   `--SEARCH TABLE ncbitaxon AS n2 USING INDEX idx_ncbitaxon_subject (subject=?)
g0czyy6m

g0czyy6m1#

我不能给出一个完整的答案,但这里有一些想法:

  • 在慢速计划中,PostgreSQL将在ncbitaxon n1中找到的行数低估了近5倍。尝试通过运行ANALYZE来收集新的统计信息来改进该估计值,或者,如果这还不够充分,则通过为这些列的相关性创建扩展的统计信息来提高估计值。

如果这一估计更准确,则将以更高的成本估计缓慢的计划,将不会选择该计划。

  • 慢计划中,在ncbitaxon n2上重复指数扫描的成本远高于PostgreSQL预估。这在一定程度上是由于上面的错误估计,但可能random_page_cost设置得比您的硬件合适,因此PostgreSQL低估了索引扫描的成本。

相关问题