从页面中提取嵌入JSON的脚本元素

ej83mcc0 于 2023-06-25 发布在其他

关注(0)|答案(1)|浏览(84)

我试图从这个网页中获取特定状态的存储位置，我只能检索content_type : text/html，我如何才能获得JSON部分，因为我知道它在那里查看html文件？

import requests
import json
headers = {
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
    'accept-language': 'en-US,en;q=0.9',
    'Content-type': 'application/json', 
    'Accept': 'text/plain'
}

response = requests.get('https://www.sephora.com/happening/storelist', headers=headers)

其结果如下

200
text/html; charset=UTF-8

现在当然我尝试了respone.json()，它抛出了一个异常，

requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

阅读输出full = json.dumps(response.text)会导致一个非常混乱的文本，其中一部分仅作为字典编写
<script id=\"linkStore\" type=\"text/json\" data-comp=\"PageJSON \">{\"page\": ....etc

编辑文章：工作代码

它的工作原理是使用bs 4解析html，然后根据text/json下的标签获取dict

soup = BeautifulSoup(response.text, 'html.parser')
    script_tag = soup.find('script', {'type': 'text/json'})  
    if script_tag:
        specific_content = script_tag.text
        json_data = json.loads(specific_content)
    else:
        print("Script tag not found.")

JSON

来源：https://stackoverflow.com/questions/76456475/extract-script-element-with-embedded-json-from-page

1条答案

按热度按时间

r1wp621o1#

您所请求的网址（URL）没有返回json的响应。相反，它返回一个plaintext/html。因此，您会得到错误。
您可以使用.content属性访问响应的内容。

response = requests.get('https://www.sephora.com/happening/storelist', headers=headers)
print(response.content)

您可以在此处参考文档：https://requests.readthedocs.io/en/latest/user/quickstart/#binary-response-content

赞(0）回复(0）举报 2023-06-25

我来回答

从页面中提取嵌入JSON的脚本元素

1条答案

相关问题

热门标签

最新问答