使用Scrapy爬虫提取Json数据?

nfs0ujit  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(125)

我正在尝试抓取恰好在XHR请求中的产品数据。如果我完全引用XHR URL,我就能够抓取所需的数据。但是我正在尝试抓取的站点对于抓取的每个产品页面都有不同的XHR请求。
这是一个产品https://www.midwayusa.com/product/939287480?pid=598174现在我注意到如果你把每个页面的url和[data] https://www.midwayusa.com/productdata/939287480?pid=598174你可以得到XHR请求的方式.我不知道如何做一个爬虫作为我的第二个scraper和新的python.
那么,从每个被爬取的页面中获取JSON数据的最简单方法是什么呢?

class PwspiderSpider(CrawlSpider):
name = 'pwspider'
allowed_domains = ['midwayusa.com']
start_urls = ['https://www.midwayusa.com/s?searchTerm=backpack']

# restricting css

le_backpack_title = LinkExtractor(restrict_css='li.product')

# Callback to ParseItem backpack and follow the parsed URL Links from URL

rule_Backpack_follow = Rule(le_backpack_title, callback='parse_item', follow=False)

# Rules set so Bot can't leave URL

rules = (
    rule_Backpack_follow,
)

def start_requests(self):
    yield scrapy.Request('https://www.midwayusa.com/s?searchTerm=backpack',
        meta={'playwright': True})

def parse_item(self, response):
    data = json.loads(response.body)
    yield from data['products']

enter image description here

beq87vna

beq87vna1#

我测试了一个页面,它使用JavaScript生成包含搜索结果的页面,但它不从其他URL获取数据-它将所有信息直接保存在HTML中,

<script> 
    window.icvData = {...} 
</script>

产品页面也是如此,它们也有直接以HTML格式保存的数据。
有时他们可能会有额外的window.icvData.firstSaleItemId = ...
但我跳过了这些信息。

import scrapy
import json
from scrapy.spiders import Spider

class PwspiderSpider(Spider):

    name = 'pwspider'

    allowed_domains = ['midwayusa.com']

    start_urls = ['https://www.midwayusa.com/s?searchTerm=backpack']

    def parse(self, response):
        print('url:', response.url)

        script = response.xpath('//script[contains(text(), "window.icvData")]/text()').get()
        #print(script)

        text = script.split("window.icvData = ")[-1].split('\n')[0].strip()

        try:
            data = json.loads(text)
        except Exception as ex:
            print('Exception:', ex)
            print(text)
            return

        #print(data["searchResult"].keys())

        products = data["searchResult"]['products']

        for item in products:
            #print(item)
            colors = [color['name'] for color in item['swatches']]
            print(item['description'], colors)
            yield response.follow(item['link'], callback=self.parse_product, cb_kwargs={'colors': colors})

    def parse_product(self, response, colors):
        print('url:', response.url)

        script = response.xpath('//script[contains(text(), "window.icvData")]/text()').get()
        #print(script)

        # I uses `.split('\n')[0]` because sometimes it may have second line with `window.icvData.firstSaleItemId = ...` 
        text = script.split("window.icvData = ")[-1].split('\n')[0].strip()

        try:
            data = json.loads(text)
            data['colors'] = colors
        except Exception as ex:
            print('Exception:', ex)
            print(text)
            return

        yield data

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({

# 'USER_AGENT': 'Mozilla/5.0',

    'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.json': {'format': 'json'}},  # new in 2.1
})
c.crawl(PwspiderSpider)
c.start()

相关问题