scrapy 我怎样才能循环无限滚动网站提取每一页?

rggaifut  于 2023-08-05  发布在  其他
关注(0)|答案(1)|浏览(137)

**我不想使用API来提取数据,我只是想学习这种方式的项目。**下一页的元素是不可见的,该网站有无限的滚动。我已经刮了第一页,但我不能刮或创建一个循环提取,直到结束页。网址是https://www.futuretools.io/

from scrapy_playwright.page import PageMethod

from playwright.sync_api import sync_playwright

class ToolsSpider(scrapy.Spider):

 name = "tools"

 def start_requests(self):

 yield scrapy.Request("https://www.futuretools.io", 

 meta=dict(

 playwright = True,

 playwright_page_methods = [

 PageMethod("wait_for_selector", "div.jetboost-list-wrapper-n5zn > div.w-dyn-items div.tool"),

 ]
 ))
 async def parse(self, response):
 for tool in response.css("div.jetboost-list-wrapper-n5zn > div.w-dyn-items div.tool"):

 yield{

 'title': tool.css('div.div-block-18 a.tool-item-link---new::text').get(),

 'description': tool.css('div.tool-item-description-box---new::text').get(),

 'total_votes': tool.css('div.list-upvote div.text-block-52::text').get(),

 'category': tool.css('div.collection-list-wrapper-9 div.text-block-53::text').get()

 }```

字符串

cunj1qz1

cunj1qz11#

你可以滚动到div.tool.w-dyn-item.w-col.w-col-6:nth-child(number_of_items)出现

编辑:

这里我使用了next_page按钮。你可以采取滚动的方法,但这似乎更干净。

import scrapy

class ToolsSpider(scrapy.Spider):
    name = "tools"
    start_urls = ['https://www.futuretools.io/']

    def parse(self, response):
        for tool in response.css("div.jetboost-list-wrapper-vq3k.w-dyn-list div.tool.w-dyn-item.w-col.w-col-6"):
            yield {
                'title': tool.css('div.div-block-18 a.tool-item-link---new::text').get(),
                'description': tool.css('div.tool-item-description-box---new::text').get(),
                'total_votes': tool.css('div.list-upvote div.text-block-52::text').get(),
                'category': tool.css('div.collection-list-wrapper-9 div.text-block-53::text').get()
            }

        next_page = response.css('.w-pagination-next.next')
        if next_page:
            yield response.follow(next_page.attrib['href'], callback=self.parse)

字符串

相关问题