需要帮助用Scrapy抓取此页的内容

qvsjd97n  于 2022-12-04  发布在  其他
关注(0)|答案(1)|浏览(140)

有人能告诉我如何使用Scrapy从这个页面抓取数据(名称和号码)吗?数据是动态加载的。如果你检查网络选项卡,你会发现一个对https://www.icab.es/rest/icab-api/collegiates的POST请求。所以我把它复制为cURL并通过Postman发送请求。但是我得到错误。有人能帮助我吗?URL:https://www.icab.es/es/servicios-a-la-ciudadania/necesito-un-abogado/buscador-de-profesionales/?extraSearch=false&probono=false

9jyewag0

9jyewag01#

**这是一个非常好的问题!**但是,也许下次您会希望添加代码,并将其格式设置得更好一些。How to ask
解决方案:

您需要重新创建请求。我使用Burp Suite检查了请求。
我在start_urls中得到了url的头,以及json_url的头和主体。
如果你试图从start_request获取json_url,你会得到401错误,所以我们首先转到start_urls url,然后才请求json_url。

完整代码:

import scrapy

class Temp(scrapy.Spider):
    name = "tempspider"

    allowed_domains = ['icab.es']
    start_urls = ['https://www.icab.es/es/servicios-a-la-ciudadania/necesito-un-abogado/buscador-de-profesionales']
    json_url = 'https://www.icab.es/rest/icab-api/collegiates'

    def start_requests(self):
        headers = {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
            "Origin": "https://www.icab.es",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "en-US,en;q=0.5",
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "DNT": "1",
            "Host": "www.icab.es",
            "Pragma": "no-cache",
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
            "Sec-Fetch-Site": "none",
            "Sec-Fetch-User": "?1",
            "Sec-GPC": "1",
            "Upgrade-Insecure-Requests": "1",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
        }

        yield scrapy.Request(url=self.start_urls[0], headers=headers, callback=self.parse)

    def parse(self, response):
        headers = {
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "DNT": "1",
            "Pragma": "no-cache",
            "Sec-GPC": "1",
            'Accept': 'application/json',
            'Accept-Encoding': 'gzip, deflate',
            'Accept-Language': 'en-US,en;q=0.9',
            'Content-Type': 'application/json',
            'Host': 'www.icab.es',
            'Sec-Ch-Ua': '"Chromium";v="91", " Not;A Brand";v="99"',
            'Sec-Ch-Ua-Mobile': '?0',
            'Origin': 'https://www.icab.es',
            'Referer': 'https://www.icab.es/es/servicios-a-la-ciudadania/necesito-un-abogado/buscador-de-profesionales',
            'Sec-Fetch-Site': 'same-origin',
            'Sec-Fetch-Mode': 'cors',
            'Sec-Fetch-Dest': 'empty',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
            "X-KL-Ajax-Request": "Ajax_Request",
        }
        body = '{"filters":{"keyword":"","name":"","surname":"","street":"","postalCode":"","collegiateNumber":"","dedication":"","language":"","paginationFirst":"1","paginationLast":"25","paginationOrder":"surname","paginationOrderAscDesc":"ASC"}}'

        yield scrapy.Request(url=self.json_url, headers=headers, body=body, method='POST', callback=self.parse_json)

    def parse_json(self, response):
        json_response = response.json()
        members = json_response['members']

        for member in members:
            yield {
                'randomPosition': member['randomPosition'],
                'collegiateNumber': member['collegiateNumber'],
                'surname': member['surname'],
                'name': member['name'],
                'gender': member['gender'],
            }

输出:

{'randomPosition': '27661107', 'collegiateNumber': '35080', 'surname': 'Abad Bamala', 'name': 'Ana', 'gender': 'M'}
{'randomPosition': '98668217', 'collegiateNumber': '14890', 'surname': 'Abad Calvo', 'name': 'Encarnacion', 'gender': 'M'}
{'randomPosition': '53180188', 'collegiateNumber': '29746', 'surname': 'Abad de Brocá', 'name': 'Laura', 'gender': 'M'}
{'randomPosition': '41073111', 'collegiateNumber': '31865', 'surname': 'Abad Esteve', 'name': 'Joan Domènec', 'gender': 'H'}
{'randomPosition': '63371735', 'collegiateNumber': '29647', 'surname': 'Abad Fernández', 'name': 'Dolors', 'gender': 'M'}
{'randomPosition': '30290704', 'collegiateNumber': '45016', 'surname': 'Abad Hernández', 'name': 'Laura', 'gender': 'M'}
{'randomPosition': '57510617', 'collegiateNumber': '16083', 'surname': 'Abad Mariné', 'name': 'Jose Antonio', 'gender': 'H'}
................
................
................

相关问题