我一直在从一个网站上收集一些信息,其中一部分需要我在网站上刮取最大页码,以告诉我的刮刀何时停止(还有其他方法可以做到这一点,但这是我决定的方向)。
最近他们改变了最大页码的显示方式,它现在是一个通过POST请求访问的变量,所以当我试图用我的旧代码抓取它时,它返回字符串:Pagination.PagesCount
.
现在,我的问题是我很难弄清楚如何访问这些信息。
请参阅以下代码来重现我的问题:
from scrapy import Selector
import requests
url = 'https://groceries.aldi.ie/en-GB/chilled-food/cheese?origin=dropdown&c1=shopgroceries&c2=chilled-food&c3=cheese&clickedon=cheese'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
'Accept-Language': 'en-GB,en;q=0.5',
'Referer': 'https://google.com',
'DNT': '1'}
html = requests.get(url, headers=headers).content
sel = Selector(text=html)
print(sel)
我在过去使用POST请求访问过这个网站的数据,例如下面的代码将返回一个关于产品定价的数据字典。但是,在检查了页面、获取了XHR并在大量页面请求中进行了大量工作之后,我似乎根本无法找到这个确切的Pagination.PagesCount
数据是从哪里提取的。
也许我只是完全忽略了它,但我会感激任何帮助你可以给予我这一点!
import json
headers = {
"authority": "groceries.aldi.ie",
"pragma": "no-cache",
"cache-control": "no-cache",
"sec-ch-ua": "\" Not;A Brand\";v=\"99\", \"Google Chrome\";v=\"91\", \"Chromium\";v=\"91\"",
"accept-language": "en-GB",
"sec-ch-ua-mobile": "?0",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36",
"websiteid": "a763fb4a-0224-4ca8-bdaa-a33a4b47a026",
"content-type": "application/json",
"accept": "application/json, text/javascript, */*; q=0.01",
"x-requested-with": "XMLHttpRequest",
"origin": "https://groceries.aldi.ie",
"sec-fetch-site": "same-origin",
"sec-fetch-mode": "cors",
"sec-fetch-dest": "empty",
"referer": "https://groceries.aldi.ie/en-GB/chilled-food/cheese?origin=dropdown&c1=shopgroceries&c2=chilled-food&c3=cheese&clickedon=cheese"
}
body = '{"products":["4088600284026","5391528370382","5391528372836","5391528372850","5391528372874","4088600298696","4088600103709","4088600388700","5035766046028","5000213021934","4088600012551","4088600325934","4088600300153","25389111","4072700001171","4088600012537","4088600012544","4088600013138","4088600013145","4088600103525","4088600103532","4088600103570","4088600103600","4088600135182","4088600141848","4088600142050","4088600158105","4088600217024","4088600217208","4088600241302","4088600249292","4088600249308","4088600280615","4088600281445","4088600283043","4088600284088","4088600295688","4088600295800","4088600295817","4088600303925"]}'
url = 'https://groceries.aldi.ie/api/product/calculatePrices'
response = requests.post(url=url, headers=headers,data=body)
data = response.text
data = data.replace("'", "\"")
d = json.loads(data)
print(d)
1条答案
按热度按时间pw9qyyiw1#
您可以从API端点获取总页数:
这应该给予你
3
。然后,使用此功能进行后续请求,并获取您需要/想要的所有产品信息。
例如:
您应该得到一个包含
104
项的表,如下所示: