scrapy 抓取多个url在分页中保持相同的页面

wi3ka0sx  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(195)

我想刮多个页面,但当我移动到其他页面的URL保持不变谁我刮页面多个页面,如果有任何解决方案,为我提供页面的链接是https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx

import scrapy
from scrapy.http import Request

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
        }

    def parse(self, response):
        books = response.xpath("//div[@class='list-group']//@href").extract()
        for book in books:
            url = response.urljoin(book)
            if url.endswith('.ro') or url.endswith('.ro/'):
                continue
            yield Request(url, callback=self.parse_book)

    def parse_book(self, response):
        title=response.xpath("//span[@id='HeadingContent_lblTitle']//text()").get()
        d1=response.xpath("//div[@class='col-md-10']//p[1]//text()").get()
        d1=d1.strip()
        d2=response.xpath("//div[@class='col-md-10']//p[2]//text()").get()
        d2=d2.strip()
        d3=response.xpath("//div[@class='col-md-10']//p[3]//span//text()").get()
        d3=d3.strip()
        d4=response.xpath("//div[@class='col-md-10']//p[4]//text()").get()
        d4=d4.strip()

        yield{
            "title1":title,
            "title2":d1,
            "title3":d2,
            "title4":d3,
            "title5":d4,

    }
biswetbf

biswetbf1#

页面内容是动态加载的,您必须单击导航按钮或在表单中输入所需的页码才能转到下一页/所需的页面。
在这种情况下,使用scrapy和 selenium 一起或纯 selenium 。
您可以检查这个无用中间件-scrapy-selenium
您可以在parse_book方法中执行selenium操作,并继续使用scrapy抓取数据

相关问题