KeyError:'驱动程序'将scrapy和selenium放在一起

iklwldmw  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(116)

他们将刮第一页时,移动到第二页,他们显示KeyError: 'driver'有没有任何解决方案,这些我想创建一个网络爬虫使用scrapy selenium .这些是页面链接https://barreau-montpellier.com/annuaire-professionnel/?cn-s我的代码看起来像这样:

import scrapy
    from scrapy import Selector
    from scrapy_selenium import SeleniumRequest

    class TestSpider(scrapy.Spider):
        name = 'test'
        page_number=1

        def start_requests(self):
          yield SeleniumRequest(url='https://barreau-montpellier.com/annuaire-professionnel/?cn-s=',callback=self.parse)

        def parse(self, response):
            driver=response.meta['driver']
            r = Selector(text=driver.page_source)

            details=r.xpath("//div[@class='cn-entry cn-background-gradient']")
            for detail in details:
                email=detail.xpath(".//span[@class='email cn-email-address']//a//@href").get()
                try:
                    email=email.replace("mailto:","")
                except:
                    email=''

                n1=detail.xpath(".//span[@class='given-name']//text()").get()
                n2=detail.xpath(".//span[@class='family-name']//text()").get()
                name=n1+n2

                telephone=detail.xpath(".//span[@class='tel cn-phone-number cn-phone-number-type-workphone']//a//text()").get()

                fax=detail.xpath(".//span[@class='tel cn-phone-number cn-phone-number-type-workfax']//a//text()").get()

                street=detail.xpath(".//span[@class='adr cn-address']//span[@class='street-address notranslate']//text()").get()
                locality=detail.xpath(".//span[@class='adr cn-address']//span[@class='locality']//text()").get()
                code=detail.xpath(".//span[@class='adr cn-address']//span[@class='postal-code']//text()").get()
                address=street+locality+code

                yield{
                    'name':name,
                    'mail':email,
                    'telephone':telephone,
                    'Fax':fax,
                    'address':address
                }
                next_page = 'https://barreau-montpellier.com/annuaire-professionnel/pg/'+ str(TestSpider.page_number)+'/?cn-s' 
                if TestSpider.page_number<=155:
                    TestSpider.page_number += 1
                    yield response.follow(next_page, callback = self.parse,)

setting .py中,我添加了以下内容:

from shutil import which

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('C:\Program Files (x86)\chromedriver.exe')
SELENIUM_DRIVER_ARGUMENTS=['--headless']  

DOWNLOADER_MIDDLEWARES = {
     'scrapy_selenium.SeleniumMiddleware': 800
     }
yks3o0rb

yks3o0rb1#

实际上,你为什么得到key error driver?最有可能的是,我清楚地知道它测试后,你的代码不止一次。你有没有测试过你的代码没有分页部分?我也得到了关键错误驱动程序,但当我摆脱分页部分的错误已经消失了。所以,对于不正确的下一页/分页,我已经在def start_requests(self)中使用range函数进行了分页,它工作得很好,没有任何问题,而且这种类型的分页比其他类型快两倍。

完整的工作代码:

import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest

class TestSpider(scrapy.Spider):
    name = 'test'
    page_number = 1

    def start_requests(self):
        urls = ['https://barreau-montpellier.com/annuaire-professionnel/pg/'+str(x)+'/?cn-s' for x in range(1,156)]
        for url in urls:
            yield SeleniumRequest(
                url= url,
                callback=self.parse,
                wait_time=3)

    def parse(self, response):

        driver = response.meta['driver']
        r = Selector(text=driver.page_source)

        details = r.xpath(
                "//div[@class='cn-entry cn-background-gradient']")
        for detail in details:
            email = detail.xpath(
                    ".//span[@class='email cn-email-address']//a//@href").get()
            try:
                email = email.replace("mailto:", "")
            except:
                email = ''

            n1 = detail.xpath(".//span[@class='given-name']//text()").get()
            n2 = detail.xpath(
                    ".//span[@class='family-name']//text()").get()
            name = n1+n2

            telephone = detail.xpath(
                    ".//span[@class='tel cn-phone-number cn-phone-number-type-workphone']//a//text()").get()

            fax = detail.xpath(
                    ".//span[@class='tel cn-phone-number cn-phone-number-type-workfax']//a//text()").get()

            street = detail.xpath(
                    ".//span[@class='adr cn-address']//span[@class='street-address notranslate']//text()").get()
            locality = detail.xpath(
                    ".//span[@class='adr cn-address']//span[@class='locality']//text()").get()
            code = detail.xpath(
                    ".//span[@class='adr cn-address']//span[@class='postal-code']//text()").get()
            address = street+locality+code

            yield {
                'name': name,
                'mail': email,
                'telephone': telephone,
                'Fax': fax,
                'address': address
            }

输出:

{'name': 'CharlesZWILLER', 'mail': 'zwiller.avocat@gmail.com', 'telephone': '04 67 60 24 56', 'Fax': '04 
67 60 00 58', 'address': '24 Bd du Jeu de PaumeMONTPELLIER34000'}
2022-08-15 11:56:31 [scrapy.core.engine] INFO: Closing spider (finished)
2022-08-15 11:56:31 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://localhost:51142/se /session/da80a3907e6e6e78f9356f20bf4103be HTTP/1.1" 200 14
2022-08-15 11:56:31 [selenium.webdriver.remote.remote_connection] DEBUG: Remote re /session/da80a3907e6e6sponse: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-csponse: status=200 | daache'})                                                                           : 'application/json; ch
2022-08-15 11:56:31 [selenium.webdriver.remote.remote_connection] DEBUG: Finished 
Request                                                                           Request
2022-08-15 11:56:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 29687144,
 'downloader/response_count': 155,
 'downloader/response_status_count/200': 155,
 'elapsed_time_seconds': 2230.899805,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 8, 15, 18, 56, 31, 850294),
 'item_scraped_count': 1219,
 'log_count/DEBUG': 3864,
 'log_count/INFO': 37,
 'response_received_count': 155,
 'scheduler/dequeued': 155,

相关问题