Scrapy和Selenium一起刮网站

pu82cl6c  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(170)

最大的挑战,我刮多页与 selenium 和scrapy我已经搜索了许多问题,如何刮多页与 selenium 和scrapy,但我找不到任何解决方案,我面临的问题是,他们将只刮1页
我用 selenium 刮多个页面,它为我工作,但 selenium 不快刮多个页面比我会移动到scrapy,因为他们快得多,与 selenium 相比,这是页面链接https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx

import scrapy
from selenium import webdriver

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
        }

    def __init__(self):
      self.driver = webdriver.Chrome('C:\Program Files (x86)\chromedriver.exe')

    def parse(self, response):
        for k in range(1,10):
            books = response.xpath("//div[@class='list-group']//@href").extract()
            for book in books:
                url = response.urljoin(book)
                if url.endswith('.ro') or url.endswith('.ro/'):
                    continue
                yield Request(url, callback=self.parse_book)

        next = self.driver.find_element_by_xpath("//a[@id='MainContent_PagerTop_NavNext']")
        next.click()

    def parse_book(self, response):

        title=response.xpath("//span[@id='HeadingContent_lblTitle']//text()").get()
        d1=response.xpath("//div[@class='col-md-10']//p[1]//text()").get()
        d1=d1.strip()
        d2=response.xpath("//div[@class='col-md-10']//p[2]//text()").get()
        d2=d2.strip()
        d3=response.xpath("//div[@class='col-md-10']//p[3]//span//text()").get()
        d3=d3.strip()
        d4=response.xpath("//div[@class='col-md-10']//p[4]//text()").get()
        d4=d4.strip()

        yield{
            "title1":title,
            "title2":d1,
            "title3":d2,
            "title4":d3,
            "title5":d4,
        }
mzaanser

mzaanser1#

你最好为你的Scrapy项目使用或者创建一个下载中间件,你可以在文档中找到关于Scrapy下载中间件的一切:https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
我建议使用内置库,如scrapy-selenium-middleware
1.安装库:pip install scrapy-selenium-middleware
1.在您的零碎项目设置文件中设置以下设置:

DOWNLOADER_MIDDLEWARES = {"scrapy_selenium_middleware.SeleniumDownloader":451}
CONCURRENT_REQUESTS = 1 # multiple concurrent browsers are not supported yet
SELENIUM_IS_HEADLESS = False
SELENIUM_PROXY = "http://user:password@my-proxy-server:port" # set to None to not use a proxy
SELENIUM_USER_AGENT = "User-Agent: Mozilla/5.0 (<system-information>) <platform> (<platform-details>) <extensions>"           
SELENIUM_REQUEST_RECORD_SCOPE = ["api*"] # a list of regular expression to record the incoming requests by matching the url
SELENIUM_FIREFOX_PROFILE_SETTINGS = {}
SELENIUM_PAGE_LOAD_TIMEOUT = 120

有关该库的更多信息,请访问:https://github.com/Tal-Leibman/scrapy-selenium-middleware

相关问题