使用ASPX进行网页抓取以查找PDF

iq0todco  于 2023-05-08  发布在  .NET
关注(0)|答案(1)|浏览(186)

这是我第一次尝试从有ASPX的网站上抓取,我希望练习查找和提取PDF(这样我也可以练习解析这些)。
然而,当我拉网页时,显示的内容并不代表我手动在网页上看到的内容。
我的代码,我一直在工作,几个版本,我已经建立了从网上阅读。:

import requests

REQUEST_FORM_DATA_BOUNDARY = "REQUEST_FORM_DATA_BOUNDARY"
FORM_DATA_STARTING_PAYLOAD = '--{0}\r\nContent-Disposition: form-data; name=\\"'.format(REQUEST_FORM_DATA_BOUNDARY)
FORM_DATA_MIDDLE_PAYLOAD = '\"\r\n\r\n'
FORM_DATA_ENDING_PAYLOAD = '--{0}--'.format(REQUEST_FORM_DATA_BOUNDARY)
REQUEST_CUSTOM_HEADER = {
'authority': 'investor.fastenal.com',
'path': '/Services/PressReleaseService.svc/GetPressReleaseList',
'Accept':'application/json, text/javascript, */*; q=0.01',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',
'Content-Type': 'application/json; charset=UTF-8',
'Accept-Encoding': 'gzip,deflate,br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
}

    
def generate_form_data_payload(kwargs):
    payload = ''
    for key, value in kwargs.items():
        payload += '{0}{1}{2}{3}\r\n'.format(FORM_DATA_STARTING_PAYLOAD, key, FORM_DATA_MIDDLE_PAYLOAD, value)
    payload += FORM_DATA_ENDING_PAYLOAD
    return payload

import requests

API_URL = "https://investor.fastenal.com/news-releases/default.aspx"

kwargs ={
  "pressReleaseBodyType": 0,
  "pressReleaseSelection": 3,
  "pressReleaseCategoryWorkflowId": "1cb807d2-208f-4bc3-9133-6a9ad45ac3b0",
  "excludeSelection": 1,
  "year": 2022
}

# generate payload to be sent as form-data
request_data = generate_form_data_payload(kwargs)
response = requests.post(API_URL, headers=REQUEST_CUSTOM_HEADER, data=request_data)

如果你查看response.content,你会发现有“一些”PDF,但我希望找到并提取这些PDF(这些PDF不存在于“response.content”中):

它们不会出现在任何地方的内容中。
如您所见,我尝试添加参数,根据通过有效负载找到的内容设置年份。
我没有被迫/绑到这个包,我已经尝试过与BeautifulSoup的版本。
因此,任何指针或建议都非常感谢。
谢谢大家!

osh3o9ms

osh3o9ms1#

所以我继续,看起来答案是因为我使用了页面的URL,而不是服务的URL。我是通过查看消息和测试发现这一点的。
最终的工作代码是:

import requests
import json

url = "https://investor.fastenal.com/Services/PressReleaseService.svc/GetPressReleaseList"
payload = {
    "serviceDto": {
    "ViewType": "2",
    "ViewDate": "",
    "RevisionNumber": "1",
    "LanguageId": "1",
    "Signature": "",
    "ItemCount": -1,
    "StartIndex": 0,
    "TagList": ["sales"],
    "IncludeTags": "true"
  },
  "pressReleaseBodyType": 0,
  "pressReleaseSelection": 3,
    "pressReleaseCategoryWorkflowId": "1cb807d2-208f-4bc3-9133-6a9ad45ac3b0",
  "excludeSelection": 1,
  "year": 2023
}
REQUEST_CUSTOM_HEADER = {
    'authority': 'investor.fastenal.com',
    'path': '/Services/PressReleaseService.svc/GetPressReleaseList',
    'accept':'application/json, text/javascript, */*; q=0.01',
    'referer': 'https://investor.fastenal.com/news-releases/default.aspx',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',
    'Accept-Encoding': 'gzip,deflate,br',
    'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
    'x-requested-with': 'XMLHttpRequest'
}
# Adding empty header as parameters are being sent in payload
headers = {}
r = requests.post(url, data=json.dumps({**payload, **REQUEST_CUSTOM_HEADER}), headers=REQUEST_CUSTOM_HEADER)
r.json()

相关问题