scrapy 正在尝试抓取电子邮件

a6b3iqyw  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(219)

我试图刮电子邮件,但它给予我none这些是页面链接https://www.avocats-lille.com/fr/annuaire/avocats-du-tableau-au-barreau-de-lille/3?view=entry
我将转到network tab并从检查html code,但电子邮件不存在html代码:

<div class="contact"><p>Contacter par email : <span id="cloak65106">Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser.</span><script type='text/javascript'>

代码:从scrapy导入scrapy.http import Request

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.avocats-lille.com/fr/annuaire/avocats-du-tableau-au-barreau-de-lille/3?view=entry']
    page_number = 1

    def parse(self, response):
        mail=response.xpath("//span//a[starts-with(@href, 'mailto')]/@href").get()
        yield{
            'email':mail
        }
y1aodyip

y1aodyip1#

网页是静态的,除了email部分。这就是为什么你得到无。要获取电子邮件,你可以使用scrapy与SeleniumRequest

import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest

class TestSpider(scrapy.Spider):
    name = 'test'

    def start_requests(self):

        yield SeleniumRequest(url='https://www.avocats-lille.com/fr/annuaire/avocats-du-tableau-au-barreau-de-lille/3?view=entry', callback=self.parse)

    def parse(self, response):

        driver=response.meta['driver']
        r = Selector(text=driver.page_source)
        yield {
            'mail_link': r.xpath('//*[@class="contact"]/following-sibling::div[1]/p/span/a/@href').get(),
            'mail': r.xpath('//*[@class="contact"]/following-sibling::div[1]/p/span/a/text()').get()

        }

输出:

{'mail_link': 'mailto:fzabdellatif@2MZA-avocats.com', 'mail': 'fzabdellatif@2MZA-avocats.com'}

您必须在settings.py file中添加以下代码


# Middleware

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

# Selenium

from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')

# '--headless' if using chrome instead of firefox

SELENIUM_DRIVER_ARGUMENTS = ['--headless']

相关问题