json脚本/标记中的Scrapy产品数据不能作为目标

7fyelxc5 于 2022-11-09 发布在其他

关注(0)|答案(2)|浏览(139)

我可以解析脚本中的 json 数据，但是我不能针对特定的标记。
If it would be a "normal" script type like "application/ld+json", it would be pretty easy to collect what I need. But I cannot address the script name, cause there is no name 😅 So I used the XPATH selector to the script via dev tools.
因此，我从 product link 读取 Scrapy shell 中的 json 数据

scrapy shell 'https://www.electronic4you.de/makita-dmp180z-akku-kompressor-189153.html' ...
...
>>> response.xpath('//*[@id="root-wrapper"]/div/script[5]/text()').get()

中的每一个
当然，我会得到脚本中的所有数据。结果显示：
[ ' new E4uTrack （ " 查看项目 " ， {" 货币 " ： " 欧元 " ， " 价值 " ： " 49 " ， " 页面类型 " ： " 产品 " ， " 标题 " ： " 牧田 DMP180Z Akku-Kompressor " ， " 项目 " ： [ {" 项目名称 " ： " 牧田 DMP180Z Akku-Kompressor " ， " 项目编号 " ： " 189153 " ， " 编号 " ： " 189153 " ， " 项目 mpn " ： " DMP180Z " ， " 项目 _ gtin " ： " 0088381898263 " ， " 项目 _ 品牌 " ： " 牧田 " ， " google _ business _ vertical " ： " 零售 " ， " 价格 " ： " 49 " ， " 货币 " ： " 欧元 "
通常，我会以一个值为目标，例如， " item _ name " 中的值，其中包含：

scrapy shell 'https://www.electronic4you.de/makita-dmp180z-akku-kompressor-189153.html' ...
...
>>> script_tag = response.xpath('//*[@id="root-wrapper"]/div/script[5]/text()').get()
>>> import json
>>> json.loads(script_tag)["items"]["item_name"]

格式
可是 ... ...
输出显示：
文件 " " ，第 1 行 json . loads （脚本标记） [ " 项目名称 " ]
缩进错误：意外缩进
我有两个问题。
我用 xpath 选择器正确地定位脚本了吗？我如何从这个脚本中只定位我需要的标签？

scrapy

来源：https://stackoverflow.com/questions/73838203/scrapy-product-datas-in-a-json-script-tags-cannot-be-targeted

2条答案

按热度按时间

tmb3ates1#

您必须遍历ResultSet并提取所需的数据

import scrapy
import json
class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.electronic4you.de/makita-dmp180z-akku-kompressor-189153.html']

    def parse(self, response):

        script_tag =  response.xpath('(//*[@id="root-wrapper"]/div/script)[5]/text()').re_first(r'E4uTrack\("view_item",(.+?)\)$')
        json_data= json.loads(script_tag)

        for item in json_data["items"]:
            yield {
                'Name':item["item_name"]
                }

输出：

{'Name': 'Makita DMP180Z Akku-Kompressor'}

赞(0）回复(0）举报 2022-11-09

rkttyhzu2#

您需要使用正则表达式来清除XPath结果：

json_raw = response.xpath('//script[contains(., "E4uTrack(\"view_item\"")]/text()').re_first(r'E4uTrack\("view_item",(.+?)\)$')
data = json.loads(json_raw)

赞(0）回复(0）举报 2022-11-09

我来回答

json脚本/标记中的Scrapy产品数据不能作为目标

2条答案

相关问题

热门标签

最新问答