我需要把一个独特的网页列表保存在一个文件夹中 pages
以多对多关系链接的表 page_providers
表via page_xref_page_provider
table。我很难设计一个高效的原子批量刷新操作,包括以下内容:
从页面提供程序接收新的页面列表。该列表中的某些页面可能与数据库中已记录的页面相同(具有相同的 Url
),而有些页面可以从列表中删除,有些页面可以添加。
数据库中有一些每页的统计信息,所以我不应该删除旧页(由unique标识) Url
)如果至少有一个页面提供程序的此页面仍在列表中。
如果当前页面提供程序中的更新列表不包含它以前包含的页面,并且其他页面列表提供程序的列表中没有包含此页面,则应将该页面从列表中删除 pages
table。
当我收到页面列表时没有记录的页面必须添加到 pages
表和中的交叉引用 page_xref_page_provider
我尝试过:
-- We use IGNORE to handle duplicate URLs on the list we received from the current page provider
-- pages_temp is a temporary table whose creation I have omitted
INSERT IGNORE INTO pages_temp (Url, Host, Port) VALUES (?, ?, ?);
BEGIN;
-- In the DB client program, we get the last inserted ID from the following query and the number of
-- rows affected, so to get a range of newly inserted IDs
INSERT IGNORE INTO pages (Url, Host, Port) SELECT Url, Host, Port FROM pages_temp;
-- This doesn't work (wrong syntax), could you correct me here?
-- When preparing this statement, we parameterize it with the current PageProviderID, the
-- last inserted ID (which is actually the first ID in the bulk) and the number of rows inserted
-- plus the first ID in the bulk.
INSERT INTO page_xref_page_provider (PageProviderID, PageID) SELECT ?, i BETWEEN ? AND ?;
-- This query is parametrized with the current page provider ID
DELETE page_xref_page_provider FROM page_xref_page_provider AS pxpp
JOIN pages ON pxpp.PageID = pages.ID AND pxpp.PageProviderID=?
WHERE pages.Url NOT IN (SELECT Url FROM pages_temp);
-- This seems inefficient because the subquery also fetch the relations not affected by the current
-- list of pages / page provider
DELETE FROM pages WHERE pages.ID NOT IN (SELECT DISTINCT PageID FROM page_xref_page_provider);
COMMIT;
1条答案
按热度按时间7eumitmz1#
避免
NOT IN ( SELECT ... )
. 在某些情况下,它的性能是可怕的。LEFT JOIN
以及EXISTS
可能会更快。有
AUTO_INCREMENT
表中的ID?如果是的话,要当心IGNORE
“燃烧ID”。这里讨论一种高速摄取技术:http://mysql.rjweb.org/doc.php/staging_table
多对多表的性能提示:http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table