因此,我需要从该网站上的产品的URL中提取评论,更具体地说,是用户名、日期、文本和分数。然而,我有一些问题,因为我不断得到一个错误:无法检索第% 1页的评论。错误:“连接断开:InvalidChunkLength(get length b'',0 bytes read)";“InvalidChunkLength(得到长度b '',读取0字节)";我试着增加一个时间延迟,但它仍然不起作用。我如何修改它?
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.emag.ro/covor-antiderapant-negru-poliester-80-x-300-cm-c027-80x300/pd/DBY5YJMBM/?ref=sponsored_products_fill_a_b_5_3&provider=rec&recid=rec_73_c449bb3e50b63cc8f6da4a42a31af359f6cbfb3c547bc5748cb6d45501a29685_1684315709&scenario_ID=73&aid=034a897a-956c-11ed-9004-0ab644dfda7c&oid=89847310"
review_url = "https://www.emag.ro/review/get-review-listing-page?id={product_id}&page={page}"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0'}
product_id = url.split("/pd/")[1].split("/")[0]
reviews = []
page = 1
while True:
r_url = review_url.format(product_id=product_id, page=page)
try:
response = requests.get(r_url, headers=headers)
response.raise_for_status()
data = response.json()
except (requests.RequestException, json.JSONDecodeError) as e:
print(f"Failed to retrieve reviews for page {page}. Error: {str(e)}")
break
if not data['reviews']:
break
for r in data['reviews']:
review_text = r['content']
author = r['author']['name']
date = r['date']
score = r['rating']
reviews.append({"author": author, "date": date, "review_text": review_text, "score": score})
page += 1
with open('reviews.json', 'w') as f:
json.dump(reviews, f, indent=4)
1条答案
按热度按时间iovurdzv1#
你的评论URL都错了。
要获得评论,您需要以下部分,例如:
https://www.emag.ro/product-feedback/
covor-pufos-moale-compatibil-multiple-spatii-si-stiluri-grosime-4cm-120cm-x-160cm-gri-ronyes18/pd/DBSKZPMBM
/reviews/list
这给了你一个JSON,里面有你需要的一切。你自己看看:
https://www.emag.ro/product-feedback/covor-kring-meknes-1200-gsm-100-poliester-160x230-cm-maro-e2020-8b/pd/D605NYMBM/reviews/list
下面是一个完整的工作示例:
这应该打印出来(为简洁起见缩短):
这也会在遍历
URLs
时将JSON响应转储到文件中。示例JSON太大,无法显示所有内容,但运行以下代码,例如:
应输出: