我想刮多个页面,但当我移动到其他页面的URL保持不变谁我刮页面多个页面,如果有任何解决方案,为我提供页面的链接是https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx
import scrapy
from scrapy.http import Request
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx']
custom_settings = {
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'DOWNLOAD_DELAY': 1,
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
def parse(self, response):
books = response.xpath("//div[@class='list-group']//@href").extract()
for book in books:
url = response.urljoin(book)
if url.endswith('.ro') or url.endswith('.ro/'):
continue
yield Request(url, callback=self.parse_book)
def parse_book(self, response):
title=response.xpath("//span[@id='HeadingContent_lblTitle']//text()").get()
d1=response.xpath("//div[@class='col-md-10']//p[1]//text()").get()
d1=d1.strip()
d2=response.xpath("//div[@class='col-md-10']//p[2]//text()").get()
d2=d2.strip()
d3=response.xpath("//div[@class='col-md-10']//p[3]//span//text()").get()
d3=d3.strip()
d4=response.xpath("//div[@class='col-md-10']//p[4]//text()").get()
d4=d4.strip()
yield{
"title1":title,
"title2":d1,
"title3":d2,
"title4":d3,
"title5":d4,
}
1条答案
按热度按时间biswetbf1#
页面内容是动态加载的,您必须单击导航按钮或在表单中输入所需的页码才能转到下一页/所需的页面。
在这种情况下,使用scrapy和 selenium 一起或纯 selenium 。
您可以检查这个无用中间件-scrapy-selenium
您可以在
parse_book
方法中执行selenium操作,并继续使用scrapy抓取数据