网络爬虫- Scrapy Python

njthzxwz  于 2022-12-18  发布在  Python
关注(0)|答案(1)|浏览(152)

我需要关于我的webcrawler的帮助。我在这里得到了一个无效语法:

"f.write("{},{},{}\n".format(word,url,count))"

而且当我命令“scrappy crawFirstSpider〉wordlist.csv”一个csv文件显示出来,但要么是空的,要么不像我想要的那样结构化。我想爬300个网站,需要尽可能结构化的数据。我怎么能得到一个csv文件与网址结构化,然后它旁边的某些关键字的计数,

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Item
import requests

def find_all_substrings(string, sub):

    import re
    starts = [match.start() for match in re.finditer(re.escape(sub), string)]
    return starts

class FirstSpider(CrawlSpider):
    name = "FirstSpider"
    allowed_domains = ["www.example.com"]
    start_urls = ["https://www.example.com/"]
    rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]

    crawl_count = 0
    words_found = 0      
                           

    def check_buzzwords(self, response):

        self.__class__.crawl_count += 1

        wordlist = [
            "keyword1",
            "keyword2",
            "keyword3"
            ]

        url = response.url
        data = response.body.decode('utf-8')
        count = 0
         
        for word in wordlist:
                substrings = find_all_substrings(data, word)
                count = 0
                word_counts = {}
                links = []
                "f = open('wordlist.csv', 'w')" 
                for pos in substrings:
                        ok = False
                        if not ok:
                            count += 1
                            word_counts[word] = {url: count}
                            
        for link in links:
            page = requests.get(link)
            data = page.text

        for word in wordlist:
                substrings = find_all_substrings(data, word)
                count = 0

        for word in wordlist:
                substrings = find_all_substrings(data, word)
                for pos in substrings:
                        ok = False
                        if not ok:
                                "f.write("{},{},{}\n".format(word,url,count))"
                                self.__class__.words_found += 1
                                print(word + ";" + url + ";" + str(count) + ";")
        with open('wordlist.csv', 'w') as f:
         for word, data in word_counts.items():
          for url, count in data.items():
            f.write("{},{},{}\n".format(word, url, count))
        f.close()
        return Item()

    def _requests_to_follow(self, response):
        if getattr(response, "encoding", None) != None:
                return CrawlSpider._requests_to_follow(self, response)
        else:
                return []

我想爬某些关键字(单词表)的网站。我的输出应该是一个csv文件与以下信息:url,在网站上找到的关键字的计数。
得到以下“f.write(“{},{},{}\n”.format(word,url,count))”的无效语法

And the output csv file is often empty or does not crawl all the urls.

zmeyuzjn

zmeyuzjn1#

第41和61行前后有不必要的引号

line 41 ---> "f = open('wordlist.csv', 'w')"
line 61 ---> "f.write("{},{},{}\n".format(word,url,count))"

而且,由于Scrapy有一个内置机制-Feed export,因此通常不需要手动将数据保存到文件中
通过使用FEED_EXPORT_FIELDS设置,您可以指定应导出项目的哪些字段及其顺序。
下面是运行spider并将数据保存到文件的命令:scrapy crawl FirstSpider -O url.csv
-O(大写“O”)表示“重写文件”
-o(小写“o”)表示“附加到现有文件”。

相关问题