mysql 如何查询具有已知重复项且具有标记的主表?

mo49yndu  于 2023-10-15  发布在  Mysql
关注(0)|答案(3)|浏览(118)

我有一个名为pics的主表,它有一个ID号和从中提取它的源站点。我还有另一个名为tags的表,其中包含从该站点提取的所有标记的列表。关键是,有已知的复制品。已知重复的方式是无关紧要的。duplicates也存储在自己的表中。第一列“original”是它在“pics”表中的第一个示例。第二列是任何additional重复。该网站根据标签搜索图像,这些标签可以是多个标签。* (我知道在pics表中的源站点可能不是最佳的,也不是正确的数据库设计。)* 这必须在MySQL/phpMyAdmin上工作。**

1.当搜索特定标签时,将列出所有图像ID号的SQL连接查询是什么?例如:

  • 搜索字符串:fire,结果:[null]
  • 搜索字符串:red,结果:#1
  • 搜索字符串:vines,结果:#2
  • 搜索字符串:green,结果:#2
  • 搜索字符串:forest brown,结果:#2
  • 搜索字符串:bird tree,结果:[null]
    2.合并镜像源站点的SQL连接查询是什么?例如:
  • 图像的研究中心:1,结果:colors.com
  • 图像的研究中心:2,结果:colors.comnature.com
  • 图像的研究中心:4,结果:nature.com
  • (获取#5和#6的网站并不重要,因为它们总是引用#2。
    3.或者老实说,不做这些复杂的连接操作,做更简单的连接查询,并有一个自动化的,预定的过程手动合并标签和删除重复的条目会更容易吗?

下面是复制粘贴到空白数据库的SQL:

START TRANSACTION;
CREATE TABLE `duplicates` (`original` int(2) NOT NULL,`additional` int(2) NOT NULL);
INSERT INTO `duplicates` (`original`, `additional`) VALUES (2, 5),(2, 6);
CREATE TABLE `pics` (`id` int(2) NOT NULL,`sourceSite` text NOT NULL);
INSERT INTO `pics` (`id`, `sourceSite`) VALUES (1, 'colors.com'),(2, 'colors.com'),(3, 'colors.com'),(4, 'nature.com'),(5, 'nature.com'),(6, 'nature.com');
CREATE TABLE `tags` (`id` int(2) NOT NULL,`tag` varchar(16) NOT NULL);
INSERT INTO `tags` (`id`, `tag`) VALUES (1, 'red'),(1, 'orange'),(2, 'green'),(2, 'brown'),(3, 'blue'),(3, 'cyan'),(4, 'bird'),(4, 'flight'),(5, 'tree'),(5, 'forest'),(6, 'vines'),(6, 'tree');
ALTER TABLE `duplicates` ADD KEY `original` (`original`), ADD KEY `additional` (`additional`);
ALTER TABLE `pics` ADD PRIMARY KEY (`id`);
ALTER TABLE `tags` ADD KEY `id` (`id`);
ALTER TABLE `duplicates` ADD CONSTRAINT `duplicates_ibfk_1` FOREIGN KEY (`original`) REFERENCES `pics` (`id`) ON DELETE CASCADE ON UPDATE CASCADE, ADD CONSTRAINT `duplicates_ibfk_2` FOREIGN KEY (`additional`) REFERENCES `pics` (`id`) ON DELETE CASCADE ON UPDATE CASCADE;
ALTER TABLE `tags` ADD CONSTRAINT `tags_ibfk_1` FOREIGN KEY (`id`) REFERENCES `pics` (`id`) ON DELETE CASCADE ON UPDATE CASCADE;
COMMIT;
zed5wv10

zed5wv101#

用你目前的结构做这件事是非常低效的,但可以做到。
需要注意的是,由于duplicates表是一个连接表,因此应该使用复合索引在两个方向上对它进行索引:

CREATE TABLE `duplicates` (
    `original` INT NOT NULL,
    `additional` INT NOT NULL,
    PRIMARY KEY (`original`, `additional`),
    INDEX (`additional`, `original`),
    FOREIGN KEY (`original`) REFERENCES `pics` (`id`) ON DELETE CASCADE ON UPDATE CASCADE,
    FOREIGN KEY (`additional`) REFERENCES `pics` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
);

根据一组标签搜索pic ID:

SELECT COALESCE(d.original, t.id) AS id
FROM tags t
LEFT JOIN duplicates d ON d.additional = t.id
WHERE t.tag IN ('forest', 'brown', 'vines')
GROUP BY 1
HAVING COUNT(DISTINCT t.tag) = 3 -- because we have 3 tags to find

搜索给定pic ID的源:

SELECT p.sourceSite
FROM pics p
WHERE p.id = 2
UNION
SELECT p.sourceSite
FROM pics p
JOIN duplicates d ON d.additional = p.id
WHERE d.original = 2;

如果你想把上面两个联系在一起,这样你就可以根据一组标签搜索图片,并返回ID,来源和标签的完整列表,你可以做以下事情。

第一步:获取图像的ID和所有需要的标签:

WITH pics_by_tags (pic_id) AS (
    SELECT COALESCE(d.original, t.id)
    FROM tags t
    LEFT JOIN duplicates d ON d.additional = t.id
    WHERE t.tag IN ('forest', 'brown', 'vines')
    GROUP BY 1
    HAVING COUNT(DISTINCT t.tag) = 3 -- because we have 3 tags to find
)
SELECT * FROM pics_by_tags;

输出量:
| pic_id|
| --|
| 2 |

第二步:获取上一步返回的每个图片的源代码:

WITH pics_by_tags (pic_id) AS (
    /* as before */
),
pics_with_sites (pic_id, sites) AS (
    SELECT pbt.pic_id, GROUP_CONCAT(DISTINCT p.sourceSite)
    FROM pics_by_tags pbt
    JOIN pics p
        ON pbt.pic_id = p.id
        OR EXISTS (SELECT 1 FROM duplicates d WHERE pbt.pic_id = d.original AND p.ID = d.additional)
    GROUP BY 1
)
SELECT * FROM pics_with_sites;

输出量:
| pic_id|网站|
| --|--|
| 2 |网址:colors.com,nature.com|

第三步:获取上一步结果的完整标签列表:

WITH pics_by_tags (pic_id) AS (
    /* as before */
),
pics_with_sites (pic_id, sites) AS (
    /* as before */
)
SELECT p.pic_id, p.sites, GROUP_CONCAT(DISTINCT t.tag ORDER BY t.tag) AS tags
FROM pics_with_sites p
JOIN tags t
    ON p.pic_id = t.id
    OR EXISTS (SELECT 1 FROM duplicates d WHERE p.pic_id = d.original AND t.id = d.additional)
GROUP BY p.pic_id, p.sites;

输出量:
| pic_id|网站|标签|
| --|--|--|
| 2 |网址:colors.com,nature.com|棕色,森林,绿色,树,藤蔓|
完整的查询是:

WITH pics_by_tags (pic_id) AS (
    SELECT COALESCE(d.original, t.id)
    FROM tags t
    LEFT JOIN duplicates d ON d.additional = t.id
    WHERE t.tag IN ('forest', 'brown', 'vines')
    GROUP BY 1
    HAVING COUNT(DISTINCT t.tag) = 3 -- because we have 3 tags to find
),
pics_with_sites (pic_id, sites) AS (
    SELECT pbt.pic_id, GROUP_CONCAT(DISTINCT p.sourceSite)
    FROM pics_by_tags pbt
    JOIN pics p
        ON pbt.pic_id = p.ID
        OR EXISTS (SELECT 1 FROM duplicates d WHERE pbt.pic_id = d.original AND p.id = d.additional)
    GROUP BY 1
)
SELECT p.pic_id, p.sites, GROUP_CONCAT(DISTINCT t.tag ORDER BY t.tag) AS tags
FROM pics_with_sites p
JOIN tags t
    ON p.pic_id = t.id
    OR EXISTS (SELECT 1 FROM duplicates d WHERE p.pic_id = d.original AND t.id = d.additional)
GROUP BY p.pic_id, p.sites;

显然,一个没有重复的规范化结构将更好地完成这项任务。
例如:

pictures (id, other_picture_meta_data)
sources (id, name)
tags_new (id, name)
pics_sources (pic_id, source_id)
pics_tags (pic_id, tag_id)

使用修改后的结构,查询变为:

WITH pics_by_tags (pic_id) AS (
    SELECT pt.pic_id
    FROM tags_new tn
    JOIN pics_tags pt ON tn.id = pt.tag_id
    WHERE tn.name IN ('forest', 'brown', 'vines')
    GROUP BY 1
    HAVING COUNT(tn.name) = 3 -- because we have 3 tags to find
),
pics_with_sites (pic_id, sites) AS (
    SELECT pbt.pic_id, GROUP_CONCAT(s.name)
    FROM pics_by_tags pbt
    JOIN pics_sources ps ON pbt.pic_id = ps.pic_id
    JOIN sources s ON ps.source_id = s.id
    GROUP BY 1
)
SELECT p.*, pws.sites, GROUP_CONCAT(tn.name ORDER BY tn.name) AS tags
FROM pics_with_sites pws
JOIN pictures p ON pws.pic_id = p.id
JOIN pics_tags pt ON pws.pic_id = pt.pic_id
JOIN tags_new tn ON pt.tag_id = tn.id
GROUP BY p.id;

这里有一个db<>fiddle玩。

qnyhuwrf

qnyhuwrf2#

下面是您的模式的图形表示。让我们一起努力。

  1. SQL查询列出给定标签下的图像:
SELECT pics.*
FROM tags
    LEFT JOIN duplicates ON tags.id = duplicates.original OR tags.id = duplicates.additional
    LEFT JOIN pics on duplicates.original = pics.id
WHERE tags.tag = "green"
GROUP BY duplicates.original;

也就是说,给定输入标签,例如,green
1.查询tags.tag等于"green"的标签,
1.找出所有与tags.id相关的duplicates
1.通过按duplicates.original分组将这些副本“压缩”为一个,以便只有一个副本“占上风”,然后
1.根据上面的结果得到相应的pics.*
事后看来,您可能需要考虑在表tagsduplicates之间创建外键约束。
1.用于列出图像源站点的SQL查询:

SELECT pics.sourceSite
FROM duplicates
    LEFT JOIN pics ON duplicates.additional = pics.id
WHERE duplicates.original = 2 OR duplicates.additional = 2
GROUP BY pics.sourceSite

也就是说,给定输入图像ID,例如2
1.查找引用图像ID 2duplicates
1.查询pics关联到以上结果中的任何重复项,则
1.分组并选择这些图片的源站点。
1.理想情况下,您希望删除重复项并规范化标记。但在现实中,可能存在不这样做的有效因素。简而言之,“这取决于”。

shyt4zoc

shyt4zoc3#

考虑在包含匹配图像的所有标记的列上使用FULLTEXT索引。
然后这将捕获#5和#6,由于更多的“相关性”,首先订购#5:

MATCH(tags) AGAINST ("forest tree" IN BOOLEAN MODE)

而这只会抓住#5:

MATCH(tags) AGAINST ("+forest +tree" IN BOOLEAN MODE)

相关问题