最大的挑战,我刮多页与 selenium 和scrapy我已经搜索了许多问题,如何刮多页与 selenium 和scrapy,但我找不到任何解决方案,我面临的问题是,他们将只刮1页
我用 selenium 刮多个页面,它为我工作,但 selenium 不快刮多个页面比我会移动到scrapy,因为他们快得多,与 selenium 相比,这是页面链接https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx
import scrapy
from selenium import webdriver
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx']
custom_settings = {
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'DOWNLOAD_DELAY': 1,
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
def __init__(self):
self.driver = webdriver.Chrome('C:\Program Files (x86)\chromedriver.exe')
def parse(self, response):
for k in range(1,10):
books = response.xpath("//div[@class='list-group']//@href").extract()
for book in books:
url = response.urljoin(book)
if url.endswith('.ro') or url.endswith('.ro/'):
continue
yield Request(url, callback=self.parse_book)
next = self.driver.find_element_by_xpath("//a[@id='MainContent_PagerTop_NavNext']")
next.click()
def parse_book(self, response):
title=response.xpath("//span[@id='HeadingContent_lblTitle']//text()").get()
d1=response.xpath("//div[@class='col-md-10']//p[1]//text()").get()
d1=d1.strip()
d2=response.xpath("//div[@class='col-md-10']//p[2]//text()").get()
d2=d2.strip()
d3=response.xpath("//div[@class='col-md-10']//p[3]//span//text()").get()
d3=d3.strip()
d4=response.xpath("//div[@class='col-md-10']//p[4]//text()").get()
d4=d4.strip()
yield{
"title1":title,
"title2":d1,
"title3":d2,
"title4":d3,
"title5":d4,
}
1条答案
按热度按时间mzaanser1#
你最好为你的Scrapy项目使用或者创建一个下载中间件,你可以在文档中找到关于Scrapy下载中间件的一切:https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
我建议使用内置库,如
scrapy-selenium-middleware
1.安装库:
pip install scrapy-selenium-middleware
个1.在您的零碎项目设置文件中设置以下设置:
有关该库的更多信息,请访问:https://github.com/Tal-Leibman/scrapy-selenium-middleware