Scrapy.处理分页

kuuvgm7e  于 2023-04-21  发布在  其他
关注(0)|答案(1)|浏览(148)

我正在使用scrapy从habermeyer.de收集数据。虽然很容易遍历类别和产品,但我找不到正确的方法来保留分页。如果我们检查Web浏览器中的分页机制,我们会看到每次我们按下按钮以“查看更多项目”时,我们实际上会发送一个带有一些表单数据的POST请求,因此,它会返回包含新产品的HTML,而且所需的表单数据会被注入到按钮的data-search-params属性中,因此可以很容易地提取并序列化为JSON。
假设我们有一个category。为了进行实验,我从 *Chrome的开发者工具 * 中复制了表单数据,同时手动与分页交互,并将其粘贴到下面的脚本中,我在scrapy shell中使用了它:

from scrapy.http import FormRequest

pagination_api_url = "https://www.habermeyer.de/spielwaren-habermeyer-ek-neuburgdonau/search/navigationasn"
form_data = {
  'factFinderSearchParameters': {
    'filters': [
      {
        'name': 'CategoryPath',
        'substring': False,
        'values': [{'exclude': False, 'type': 'or', 'value': ['Rennbahnen, RC & Modellbau']}]
      }  
    ],
    'hitsPerPage': 24,
    'marketIds': ['400866330'],
    'page': 3,
    'query': '*'
  },
  'useAsn': '0'
}
headers = {
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
    "Origin": "https://www.habermeyer.de",
    "Referer": "https://www.habermeyer.de/spielwaren-habermeyer-ek-neuburgdonau/k/rennbahnen-rc-modellbau",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
}
r = FormRequest(pagination_api_url, formdata=form_data, headers=headers)
# fetch(r)

注意:为了避免避免TypeError: to_bytes must receive a str or bytes object, got int,我必须将useAsn的值转换为str。

虽然获取表单请求返回 HTTP 200,但返回的HTML内容表明搜索没有返回任何结果。
作为另一个实验,我从 *Chrome的开发者工具 * 中复制了编码的表单数据,并将其传递到一个简单的POST请求中(参见下面的代码)。结果,我收到了新产品的预期HTML输出:

from scrapy import Request

encoded_form_data = "factFinderSearchParameters=%7B%22filters%22%3A%5B%7B%22name%22%3A%22CategoryPath%22%2C%22substring%22%3Afalse%2C%22values%22%3A%5B%7B%22exclude%22%3Afalse%2C%22type%22%3A%22or%22%2C%22value%22%3A%5B%22Rennbahnen%2C+RC+%26+Modellbau%22%5D%7D%5D%7D%5D%2C%22hitsPerPage%22%3A24%2C%22marketIds%22%3A%5B%22400866330%22%5D%2C%22page%22%3A3%2C%22query%22%3A%22*%22%7D&useAsn=0"
r = Request(pagination_api_url, method="POST", body=encoded_form_data, headers=headers)
# fetch(r)

对表示为JSON的初始表单数据进行编码也没有帮助,尽管请求返回HTTP 200:

from urllib.parse import urlencode

encoded_form_data = urlencode(form_data)
r = Request(pagination_api_url, method="POST", body=encoded_form_data, headers=headers)
# fetch(r)

Python版本:3.10.6
Scrapy版本:2.8.0

sxpgvts3

sxpgvts31#

这个应该够了

from scrapy.crawler import CrawlerProcess
import scrapy
import json

class DemoSpider(scrapy.Spider):
    name = 'habermeyer'
    
    pagination_api_url = "https://www.habermeyer.de/spielwaren-habermeyer-ek-neuburgdonau/search/navigationasn"

    headers = {
        "X-Requested-With": "XMLHttpRequest",
        "Referer": "https://www.habermeyer.de/spielwaren-habermeyer-ek-neuburgdonau/k/rennbahnen-rc-modellbau",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
    }
    querystring = {"filters":[{"name":"CategoryPath","substring":"false","values":[{"exclude":"false","type":"or","value":["Rennbahnen, RC & Modellbau"]}]}],"hitsPerPage":24,"marketIds":["400866330"],"page":1,"query":"*"}
    
    form_data = {
        'factFinderSearchParameters': json.dumps(querystring),
        'useAsn': '0'
    }

    def start_requests(self):
        yield scrapy.FormRequest(
            self.pagination_api_url, 
            method="POST", 
            formdata=self.form_data, 
            headers=self.headers,
            callback=self.parse
        )

    def parse(self,response):
        if not response.css(".searchResultInformation"):
            return

        for item in response.css(".searchResultInformation::text").getall():
            yield {"title": item.strip()}

        self.querystring['page'] = self.querystring['page']+1
        
        self.form_data = {
            'factFinderSearchParameters': json.dumps(self.querystring),
            'useAsn': '0'
        }

        yield scrapy.FormRequest(
            self.pagination_api_url, 
            method="POST", 
            formdata=self.form_data, 
            headers=self.headers,
            callback=self.parse,
            dont_filter=True
        )

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(DemoSpider)
    process.start()

相关问题