scrapy 无法使用scrappy-playwright API抓取多个页面

bq3bfh9z  于 2022-12-23  发布在  其他
关注(0)|答案(1)|浏览(299)

背景:我只是一个网络抓取新手。我试着抓取一个本地的电子商务网站。这是一个动态的网站,所以我使用scrapy-playwright(chromium)代理。
问题:在我试着抓取多个页面之前,它一直运行得很顺利。我使用了多个带有单独页码的URL。但是我没有抓取不同的页面,第一页刮了好几次,好像是剧作家的错,不知道是代码错误还是Bug,我试过不同的流程,结果都一样,有没有代理,有没有用户-探员。而且不知道为什么会这样......

import logging
import scrapy
from scrapy_playwright.page import PageMethod
from helper import should_abort_request

class ABCSpider(scrapy.Spider):
    name = "ABC"
    custom_settings = {
        'PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT': '100000',
        'PLAYWRIGHT_ABORT_REQUEST': should_abort_request
    }

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1',
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_selector", '[class="box--LNmE6"]'),
                ],
            },
        )

    async def parse(self, response):

        total= response.xpath('/html/body/div[3]/div/div[2]/div/div/div[1]/div[3]/div/ul/li[last()-1]/a/text()').extract()[0]
        total_pages = int(total)   #total_pages = 4

        links = []

        for i in range(1, total_pages+1):
            a = 'https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page={}'.format(i)
            
            links.append(a)

        for link in links:
            res = scrapy.Request(url=link, meta={
                    "playwright": True,
                    "playwright_include_page": True,
                    "playwright_page_methods": [
                        PageMethod("wait_for_selector",
                                    '[class="box--ujueT"]'),
                    ]})

            yield res and {
                "link" : response.url 
            }

输出

[
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"}
]
z8dt9xmd

z8dt9xmd1#

您不是在start_requests方法中迭代页面,而是尝试在parse方法中提取一些页面,并从那里生成更多请求。
此策略的问题在于,您在parse方法中生成的每个请求本身都由parse方法解析,因此,对于每个请求,您都告诉它为从页码中检测到的每个页面生成一整套新请求,因为每个页面上的页码可能都相同。
幸运的是,scrapy内置了一个重复过滤器,所以如果你正确地生成它们,它很可能会忽略这些重复。
下一个问题是yield语句。表达式a and b不返回ab,它只返回b。也就是说,除非a是falsy,否则它将返回a
所以你的屈服表达式...

yield res and {
                "link" : response.url 
            }

实际上只会是yield:x1米10英寸1x.
除了上面提到的,您的代码没有做任何其他事情,但是,我假设既然您指示页面等待每个待售物品的元素呈现,那么您的最终目标是从页面上的每个物品中提取数据。
因此,考虑到这一点,我建议你甚至不要使用scrapy_playwright,而是从json api获取数据,网站在ajax请求中使用这些数据。
例如:

import scrapy

class ABCSpider(scrapy.Spider):
    name = "ABC"

    def start_requests(self):
        for i in range(4):
            url = f"https://www.daraz.com.bd/xbox-games/?ajax=true&page={i}&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO"
            yield scrapy.Request(url)

    def parse(self, response):
        data = response.json()
        items = data["mods"]["listItems"]
        for item in items:
            yield {"name": item['name'],
                   "brand": item['brandName'],
                   "price": item['price']}

部分输出:

{'name': 'Xbox 360 GamePad, Xbox 360 Controller for Windows', 'brand': 'Microsoft', 'price': '1400.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Pole Bugatt RT 360 12FIT Fishing Rod Hat Chip', 'brand': 'No Brand', 'price': '1020.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Xbox 360 Controller,USB Wired Controller Gamepad for Microsoft Xbox 360,PC Windowns,XP,Vista,Win7 - Black', 'brand': 'Microsoft', 'price': '1250.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '【Seyijian】 1Set RB LB Bumpers Buttons for Microsoft XBox Series X Controller Button Holder RHA', 'brand': 'No Brand', 'price': '452.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'For Xbox One S Slim Internal Power Supply Adapter Replacement N115-120P1A 12V', 'brand': 'No Brand', 'price': '2591.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'DOU Lb Rb Lt Rt Front Bumper Buttons Set Replacement Accessory, Fits for X box Series S X Controllers', 'brand': 'No Brand', 'price': '602.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'IVYUEEN 2 Sets RB LB Bumpers Buttons for XBox Series X S Controller Trigger Button Middle Holder with Screwdriver Tool', 'brand': 'No Brand', 'price': '645.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Alloy Analog Controller Thumbsticks Replacement Parts Joysticks Analog Sticks for Xbox ONE / PS4 / Switch Controller 11 Pcs', 'brand': 'MOONEYE', 'price': '1544.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'FIFA 21 – Xbox One & Xbox Series X', 'brand': 'No Brand', 'price': '1800.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Xbox 360 Controller,USB Wired Controller Gamepad for Microsoft Xbox 360,PC Windowns,XP,Vista,Win7 - Black', 'brand': 'No Brand', 'price': '1150.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Game Consoles Flight Stick Joystick USB Simulator Flight Controller Joystick', 'brand': 'No Brand', 'price': '15179.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Power Charger Adapter For Microsoft Surfa.6 RT  Charger US Plug', 'brand': 'No Brand', 'price': '964.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '684.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'FORIDE Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '763.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '663.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '739.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '5X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motherboard', 'brand': 'No Brand', 'price': '2208.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'TP4-005 Smart Turbo Temperature Control 5-Fan For Playstation 4 For PS4', 'brand': 'No Brand', 'price': '1239.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Stencils Bga Reballing Kit for Xbox Ps3 Chip Reballing Repair Game Consoles Repair Tools Kit', 'brand': 'No Brand', 'price': '1331.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Preloved Game Kinect xbox 360 CD Cassette Xbox360', 'brand': 'No Brand', 'price': '2138.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Repair', 'brand': 'No Brand', 'price': '734'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Shadow of the Tomb Raider - Xbox One', 'brand': 'No Brand', 'price': '2800.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '5X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motherboard', 'brand': 'No Brand', 'price': '2322.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '5X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motherboard', 'brand': 'No Brand', 'price': '2027.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motheoard Repair', 'brand': 'No Brand', 'price': '649'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'XBOX 360 GAMES - DANCE CENTRAL 3 (KINECT REQUIRED) (FOR MOD /JAILBREAK CONSOLE)', 'brand': 'No Brand', 'price': '1485.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Kontrol Freek Call Of Duty Black Ops 4 Xbox One Series S-X', 'brand': 'No Brand', 'price': '810.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Hitman 2 - Xbox One', 'brand': 'No Brand', 'price': '2500.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Red Dead Redemption 2 XBOX ONE', 'brand': 'No Brand', 'price': '3800.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Wired Gaming Headphones Bass Stereo Headsets with Mic for PS4 for XBOX-ONE', 'brand': 'No Brand', 'price': '977.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '10X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motheoard', 'brand': 'No Brand', 'price': '3615'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Repair', 'brand': 'No Brand', 'price': '739.00'}

相关问题