如何在有效负载中使用令牌发出Scrapy POST请求?

nzkunb0c  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(114)

我正试图在this webpage上抓取所有22个作业,然后从其他使用同一系统托管作业的公司那里抓取更多的作业。
我可以获得页面上的前10个作业,但其余的作业必须通过单击“Show more”按钮一次加载10个。当您这样做时,URL并没有改变,我能看到的唯一变化是在POST请求的有效负载中添加了一个令牌。
Image of Request Payload in Network tool
我已经尝试按照this stackexchange questionthis one的答案进行操作,但仍然无法使其工作。
下面是我当前的代码:

def start_requests(self):
    url = 'https://apply.workable.com/api/v3/accounts/so-energy/jobs'
    headers = {'authority': 'https://apply.workable.com'}
    payload = {
      "token":"WzE2NjI2ODE2MDAwMDAsMjY0NTU4N10=",
      "query":"",
      "location":[],
      "department":[],
      "worktype":[],
      "remote":[]}
    yield scrapy.Request(url = url,
                          method='POST',
                          headers = headers,
                          body = json.dumps(payload),
                          callback = self.parse)

  def parse(self, response):
    data = json.loads(response.body)
    print(data)

这给了我前10个工作,但不会更多。如果我删除有效载荷位,我得到的结果完全相同。
有什么想法吗?
(我对编程很陌生,这是我在这里的第一个问题,所以如果我错过了一些明显的东西,我道歉,但我已经尝试了几个小时。谢谢!)

busg9geu

busg9geu1#

您需要从JSON中获取nextPage值,并在下一个页面的有效负载中使用它。

from json import dumps
import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'exampleSpider'
    API_url = 'https://apply.workable.com/api/v3/accounts/so-energy/jobs'
    custom_settings = {'DOWNLOAD_DELAY': 0.6}
    payload = {
        "department": [],
        "location": [],
        "query": "",
        "remote": [],
        "worktype": []
    }
    headers = {
        "Accept": "application/json, text/plain, */*",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "en",
        "Cache-Control": "no-cache",
        "Connection": "keep-alive",
        "Content-Type": "application/json",
        "DNT": "1",
        "Host": "apply.workable.com",
        "Origin": "https://apply.workable.com",
        "Pragma": "no-cache",
        "Referer": "https://apply.workable.com/so-energy/",
        "Sec-Fetch-Dest": "empty",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Site": "same-origin",
        "Sec-GPC": "1",
        "TE": "trailers",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
    }

    def start_requests(self):
        yield scrapy.Request(url=self.API_url, headers=self.headers, body=dumps(self.payload), method="POST")

    def parse(self, response):
        # jobs
        data = response.json()
        for job in data['results']:
            yield {'job_details': job}

        # next page
        if 'nextPage' in data:
            self.payload['token'] = data['nextPage']
            yield scrapy.Request(url=self.API_url, headers=self.headers, body=dumps(self.payload), method="POST")

相关问题