我需要关于我的webcrawler的帮助。我在这里得到了一个无效语法:
"f.write("{},{},{}\n".format(word,url,count))"
而且当我命令“scrappy crawFirstSpider〉wordlist.csv”一个csv文件显示出来,但要么是空的,要么不像我想要的那样结构化。我想爬300个网站,需要尽可能结构化的数据。我怎么能得到一个csv文件与网址结构化,然后它旁边的某些关键字的计数,
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Item
import requests
def find_all_substrings(string, sub):
import re
starts = [match.start() for match in re.finditer(re.escape(sub), string)]
return starts
class FirstSpider(CrawlSpider):
name = "FirstSpider"
allowed_domains = ["www.example.com"]
start_urls = ["https://www.example.com/"]
rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]
crawl_count = 0
words_found = 0
def check_buzzwords(self, response):
self.__class__.crawl_count += 1
wordlist = [
"keyword1",
"keyword2",
"keyword3"
]
url = response.url
data = response.body.decode('utf-8')
count = 0
for word in wordlist:
substrings = find_all_substrings(data, word)
count = 0
word_counts = {}
links = []
"f = open('wordlist.csv', 'w')"
for pos in substrings:
ok = False
if not ok:
count += 1
word_counts[word] = {url: count}
for link in links:
page = requests.get(link)
data = page.text
for word in wordlist:
substrings = find_all_substrings(data, word)
count = 0
for word in wordlist:
substrings = find_all_substrings(data, word)
for pos in substrings:
ok = False
if not ok:
"f.write("{},{},{}\n".format(word,url,count))"
self.__class__.words_found += 1
print(word + ";" + url + ";" + str(count) + ";")
with open('wordlist.csv', 'w') as f:
for word, data in word_counts.items():
for url, count in data.items():
f.write("{},{},{}\n".format(word, url, count))
f.close()
return Item()
def _requests_to_follow(self, response):
if getattr(response, "encoding", None) != None:
return CrawlSpider._requests_to_follow(self, response)
else:
return []
我想爬某些关键字(单词表)的网站。我的输出应该是一个csv文件与以下信息:url,在网站上找到的关键字的计数。
得到以下“f.write(“{},{},{}\n”.format(word,url,count))”的无效语法
And the output csv file is often empty or does not crawl all the urls.
型
1条答案
按热度按时间zmeyuzjn1#
第41和61行前后有不必要的引号
而且,由于Scrapy有一个内置机制-Feed export,因此通常不需要手动将数据保存到文件中
通过使用FEED_EXPORT_FIELDS设置,您可以指定应导出项目的哪些字段及其顺序。
下面是运行spider并将数据保存到文件的命令:
scrapy crawl FirstSpider -O url.csv
-O
(大写“O”)表示“重写文件”-o
(小写“o”)表示“附加到现有文件”。