我想用hadoop实现一个简单的搜索引擎。
因此,我使用hadoop streaming api和bash.com创建了一个反向索引,它输出如下文件:
ab (744 1) 1
abbrevi (122 1) 1
abil (51 1) (77 1) (738 1) 3
abl (99 1) (132 1) (536 1) (581 1) (695 1) (763 1) (908 1) (914 1) (986 1) (1114 2) 10
ablat (82 2) (274 2) (553 7) (587 1) (1065 3) (1096 2) (1097 7) (1098 3) (10Sorry if 99 4) (1100 4) (1101 3) (1226 3) (1241 3) (1279 1) 14
about (27 1) (32 1) (39 1) (46 1) (49 2) (56 1) (57 1) (69 2) (77 2) (81
2) (83 2) (113 1) (134 1) (139 2) (140 1) (155 1) (156 2) (162 1) (163 1) (165 2) (171 1) (174 1) (177 1) (193 5) (205 1) (206 3) (212 1) (216 3) (218 1)
(225 2) (249 3) (255 1) (257 1) (262 1) (266 3) (272 6) (273 1) (285 1) (292
2) (313 1) (315 2) (346 2) (368 1) (370 1) (371 1) (372 1) (373 1) (381 2) (391 1) (410 3) (420 1) (452 1) (456 4) (469 1) (479 1) (489 1) (498 3) (511 1)
(518 1) (531 1) (536 1) (548 1) (555 1) (556 1) (560 2) (565 1) (567 1) (572
1) (575 1) (577 1) (589 1) (601 1) (603 1) (610 1) (612 1) (614 1) (620 1) (621 4) (625 3) (626 1) (646 1) (649 1) (651 2) (657 2) (662 1) (679 1) (685 2)
(686 1) (704 2) (706 2) (709 1) (717 2) (721 1) (740 2) (757 2) (759 1) (774
1) (786 1) (792 2) (793 1) (794 2) (796 2) (801 2) (805 1) (806 1) (807 2) (808 2) (811 1) (815 1) (816 1) (829 2) (844 1) (869 1) (876 1) (912 1) (917 1)
(921 1) (927 1) (928 2) (958 1) (976 6) (991 1) (992 2) (993 1) (994 1) (996
1) (999 1) (1000 1) (1002 1) (1004 2) (1006 1) (1040 1) (1092 1) (1095 2) (1104 4) (1105 1) (1115 1) (1143 4) (1156 2) (1162 1) (1164 3) (1165 1) (1166 3) (1169 1) (1191 1)
(1194 1) (1202 1) (1209 1) (1212 1) (1218 1) (1223 1) (1224 1) (1229 1) (1230 1) (1231
1) (1239 1) (1241 1) (1244 1) (1246 1) (1248 1) (1255 2) (1262 1) (1275 2) (1282 1) (1303 1) (1304 1) (1307 1) (1310 3) (1316 1) (1335 1) (1341 1) (1344 1) (1345 1) (1353 1)
(1354 3) (1355 1) (1363 1) (1377 1) 178
它的意思是例如单词 ab
在744号文件中只重复了一次。现在我想实施 and query searching
(这意味着文档应该包含查询中的所有单词)使用hadoop流api。
那么,搜索中的map和reduce阶段到底是什么呢?另外,你能给我一些提示,我如何实现它使用流api(输入字段应该是什么?),我不知道该怎么办?)
谢谢
1条答案
按热度按时间pw136qt21#
下面是我对查询搜索问题的看法——我只是粗略地概述了应该做什么,而不是给出代码(反正我的bash技能有点生疏)。
作业设置
首先需要对查询进行标记化,将标记列表作为逗号分隔的列表放入配置值中。如果愿意,可以在mapper/reducer端执行此操作,但我建议将此部分集中在作业设置中。
制图器
从查询中读取config值,使其成为“set”或其他具有快速键查找的结构。
Map器应该Map每一行(一个单词到n个文档),如果这一行中的当前单词在您的查询集中,则“emit”它到hdfs。此阶段应将文档id作为键发出,每个单词作为值(这将创建“n”个输出记录,其中“n”是每个单词的文档数)。
减速机
然后,reducer接收一个文档id作为键和多个与您的查询匹配的令牌作为值,现在您再次读取config值并比较是否从该文档中的查询中获得了所有令牌。
您应该发出文档id作为键,并且通常在搜索中输出一些“匹配分数”作为值。在您的情况下,您只搜索“完全”匹配,所以这个分数实际上并不重要,因为它将是一个常数。
一些改进
在这样做之后,想想一些改进,在这种情况下,Map器会发出所有的标记—您真的需要它们作为单独的记录吗?也许你可以用一个组合器来节省一些网络带宽?
我把这些留给读者作为练习;—)