postgresql LATERAL JOIN不使用三元组索引

3qpi33ja 于 12个月前发布在 PostgreSQL

关注(0)|答案(3)|浏览(164)

我想使用Postgres对地址进行一些基本的地理编码。我有一个地址表，其中有大约100万个原始地址字符串：

=> \d addresses
  Table "public.addresses"
 Column  | Type | Modifiers
---------+------+-----------
 address | text |

字符串
我还有一个位置数据表：

=> \d locations
   Table "public.locations"
   Column   | Type | Modifiers
------------+------+-----------
 id         | text |
 country    | text |
 postalcode | text |
 latitude   | text |
 longitude  | text |

型
大多数地址字符串都包含postalcodes，所以我的第一个尝试是做一个like和一个横向连接：

EXPLAIN SELECT * FROM addresses a
JOIN LATERAL (
    SELECT * FROM locations
    WHERE address ilike '%' || postalcode || '%'
    ORDER BY LENGTH(postalcode) DESC
    LIMIT 1
) AS l ON true;

型
这给出了预期的结果，但它很慢。下面是查询计划：

QUERY PLAN
--------------------------------------------------------------------------------------
 Nested Loop  (cost=18383.07..18540688323.77 rows=1008572 width=91)
   ->  Seq Scan on addresses a  (cost=0.00..20997.72 rows=1008572 width=56)
   ->  Limit  (cost=18383.07..18383.07 rows=1 width=35)
         ->  Sort  (cost=18383.07..18391.93 rows=3547 width=35)
               Sort Key: (length(locations.postalcode))
               ->  Seq Scan on locations  (cost=0.00..18365.33 rows=3547 width=35)
                     Filter: (a.address ~~* (('%'::text || postalcode) || '%'::text))

型
我尝试向address列添加gist三元组索引，就像在https://stackoverflow.com/a/13452528/36191中提到的那样，但是上面查询的查询计划没有使用它，并且查询计划没有改变。

CREATE INDEX idx_address ON addresses USING gin (address gin_trgm_ops);

型
为了使用索引，我必须删除横向连接查询中的order by和limit，这并不能给予我想要的结果。下面是没有ORDER或LIMIT的查询的查询计划：

QUERY PLAN
-----------------------------------------------------------------------------------------------
 Nested Loop  (cost=39.35..129156073.06 rows=3577682241 width=86)
   ->  Seq Scan on locations  (cost=0.00..12498.55 rows=709455 width=28)
   ->  Bitmap Heap Scan on addresses a  (cost=39.35..131.60 rows=5043 width=58)
         Recheck Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))
         ->  Bitmap Index Scan on idx_address  (cost=0.00..38.09 rows=5043 width=0)
               Index Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))

型
有没有什么方法可以让查询使用索引，或者有没有更好的方法来重写这个查询？

postgresql

来源：https://stackoverflow.com/questions/37267109/lateral-join-not-using-trigram-index

3条答案

按热度按时间

njthzxwz1#

为什么？为什么？
查询无法使用主体索引。您需要在表locations上创建索引，但您拥有的索引在表addresses上。
你可以通过设置来验证我的声明：

SET enable_seqscan = off;

字符串
(In这并不是说索引会比顺序扫描更昂贵，只是Postgres根本没有办法将它用于你的查询 *。
旁白：[INNER] JOIN ... ON true只是CROSS JOIN ...的一种尴尬的说法

为什么去掉`ORDER`和`LIMIT`后使用索引？

因为Postgres可以将这个简单的表单重写为：

SELECT *
FROM   addresses a
JOIN   locations l ON a.address ILIKE '%' || l.postalcode || '%';

型
您将看到完全相同的查询计划（至少我在Postgres 9.5上的测试中是这样做的）。

溶液

您需要在locations.postalcode上创建索引。在使用LIKE或ILIKE时，您还需要将索引表达式（postalcode）到运算符的 * 左边 *。ILIKE是用运算符~~*实现的，而这个运算符没有COMMUTATOR（逻辑上的必要性），所以不可能翻转操作数。详细解释在这些相关答案中：

PostgreSQL可以索引数组列吗？
查找文本数组包含与输入类似的值的行
有没有一种方法可以有效地索引包含正则表达式模式的文本列？

一个解决方案是使用三元组相似性运算符**%或它的逆运算符，在最近邻查询中使用距离运算符**<->**（每个运算符都是自身的交换子，因此操作数可以自由交换位置）：

SELECT *
FROM   addresses a
JOIN   LATERAL (
   SELECT *
   FROM   locations
   ORDER  BY postalcode <-> a.address
   LIMIT  1
   ) l ON address ILIKE '%' || postalcode || '%';

型

为每个address找到最相似的postalcode，然后检查postalcode是否完全匹配。

这样，较长的postalcode将自动成为首选，因为它比同样匹配的较短postalcode更相似（距离更小）。
仍然存在一些不确定性。根据可能的邮政编码，由于字符串的其他部分匹配三元组，可能会出现误报。问题中没有足够的信息来说明更多。

这里 *，[INNER] JOIN而不是CROSS JOIN是有意义的，因为我们添加了一个实际的连接条件。

手册：
这可以通过GiST索引非常有效地实现，但不能通过GIN索引实现。
于是：

CREATE INDEX locations_postalcode_trgm_gist_idx ON locations
USING gist (postalcode gist_trgm_ops);

型

赞(0）回复(0）举报 12个月前

x8diyxa72#

这是一个遥远的镜头，但如何执行以下替代？

SELECT DISTINCT ON ((x.a).address) (x.a).*, l.*
FROM (
  SELECT a, l.id AS lid, LENGTH(l.postalcode) AS pclen
  FROM addresses a
  LEFT JOIN locations l ON (a.address ilike '%' || l.postalcode || '%') -- this should be fast, but produce many rows
  ) x
LEFT JOIN locations l ON (l.id = x.lid)
ORDER BY (x.a).address, pclen DESC -- this is where it will be slow, as it'll have to sort the entire results, to filter them by DISTINCT ON

字符串

赞(0）回复(0）举报 12个月前

yxyvkwin3#

如果你把侧面的连接处翻过来，它就可以工作了，但即使这样，它仍然可能很慢

SELECT DISTINCT ON (address) *
FROM (
    SELECT * 
    FROM locations
       ,LATERAL(
           SELECT * FROM addresses
           WHERE address ilike '%' || postalcode || '%'
           OFFSET 0 -- force fencing, might be redundant
        ) a
) q
ORDER BY address, LENGTH(postalcode) DESC

字符串
缺点是只能对邮政编码而不是地址实现分页。

赞(0）回复(0）举报 12个月前

我来回答

postgresql LATERAL JOIN不使用三元组索引

3条答案

为什么去掉`ORDER`和`LIMIT`后使用索引？

溶液

相关问题

热门标签

最新问答

postgresql LATERAL JOIN不使用三元组索引

3条答案

为什么去掉ORDER和LIMIT后使用索引？

溶液

相关问题

热门标签

最新问答

为什么去掉`ORDER`和`LIMIT`后使用索引？