我尝试使用蜘蛛(网络爬虫)从网页中查找关键字,该蜘蛛将与URL链接匹配的关键字存储在csv文件中。但问题是,如果关键字在同一页面上多次出现,则csv文件中存在重复。如何删除与关键字重复的链接?
output
allowed_domains = ["www.geo.tv"]
start_urls = ["https://www.geo.tv/"]
rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]
crawl_count = 0
words_found = 0
def check_buzzwords(self, response):
self.__class__.crawl_count += 1
crawl_count = self.__class__.crawl_count
wordlist = [
"Imran",
"Hello",
"Nauman",
]
url = response.url
contenttype = response.headers.get("content-type", "").decode('utf-8').lower()
data = response.body.decode('utf-8')
for word in wordlist:
substrings = find_all_substrings(data, word)
for pos in substrings:
ok = False
if not ok:
self.__class__.words_found += 1
print(word + ";" + url + ";")
return Item()
1条答案
按热度按时间3bygqnnd1#
我不太清楚你的问题是什么,但听起来你需要做的就是停止迭代
find_all_substrings
返回的完整可迭代对象。在第一次迭代之后只需要break
,因为你知道所有额外的迭代都将是重复的。例如: