selenium 如何从JSON文件中抓取和提取数据?

sshcrbum  于 2023-02-04  发布在  其他
关注(0)|答案(2)|浏览(323)

我尝试提取以下网站上每个学校的所有数据:
https://schulfinder.kultus-bw.de/
我的代码是:

import requests
from selenium import webdriver
from bs4 import BeautifulSoup
from requests import get
from selenium.webdriver.common.by import By
import json

url = "https://schulfinder.kultus-bw.de/api/school?uuid=81af189c-7bc0-44a3-8c9f-73e6d6e50fdb&_=1675072758525"

payload = {}
headers = {}

response = requests.request("GET", url, headers=headers, data=payload)

print(response.text)

输出如下:

{
  "outpost_number": "0",
  "name": "Gartenschule Grundschule Ebnat",
  "street": "Abt-Angehrn-Str.",
  "house_number": "5",
  "postcode": "73432",
  "city": "Aalen",
  "phone": "+49736796700",
  "fax": "+497367967016",
  "email": "poststelle@04125313.schule.bwl.de",
  "website": null,
  "tablet_tranche": null,
  "tablet_platform": null,
  "tablet_branches": null,
  "tablet_trades": null,
  "lat": 48.80094,
  "lng": 10.18761,
  "official": 0,
  "branches": [
    {
      "branch_id": 12110,
      "acronym": "GS",
      "description_long": "Grundschule"
    }
  ],
  "trades": []
}

我通过Chrome Inspector Network获得了代码,并请求每个 Postman 的URL。我的问题是,我只得到了一所学校的信息,我不知道如何请求所有的学校。

2izufjch

2izufjch1#

除了the answer already given之外。
要获取API的GET请求的所有搜索条件,可以使用已经导入的BeautifulSoup解析主页内容:

from bs4 import BeautifulSoup
import requests

search_page_url = "https://schulfinder.kultus-bw.de"
page_contents = requests.request("GET", search_page_url).text

parsed_html = BeautifulSoup(page_contents, features="html.parser")
input_elements = parsed_html.body.find_all('input')
search_params = list(map(lambda x: (x.get('name'), x.get('type'), x.get('value')), input_elements))

search_params包含名称、类型和值的元组,它应该可以给予您深入了解参数及其可能的值。

xyhw6mcr

xyhw6mcr2#

只需使用正确的端点:

https://schulfinder.kultus-bw.de/api/schools?distance=1&outposts=1&owner=&school_kind=&term=&types=&work_schedule=&_=1675079497084

这将为您提供list个学校,可用于使用uuid通过您的端点从问题(https://schulfinder.kultus-bw.de/api/school?...)请求更多数据。

[{"uuid":"50de01a4-503d-44d1-af4b-a6031a022b85","outpost_number":"0","name":"Grundschule Aach","city":"Aach","lat":47.84399,"lng":8.85067,"official":0,"marker_class":"marker green","marker_label":"G","website":null},{"uuid":"8818037f-9aed-4860-b42e-8a49b1403c02","outpost_number":"0","name":"Braunenbergschule Grundschule Wasseralfingen","city":"Aalen","lat":48.8612,"lng":10.11191,"official":0,"marker_class":"marker green","marker_label":"G","website":null},...]
  • 请注意,结果限制为500,您必须使用和过滤器并合并结果才能获得所有结果。*:

这样的限制是正确的。梅尔有500个特雷弗没有被使用。请你原谅我的做法,因为你有一个B。

示例
import requests

url = 'https://schulfinder.kultus-bw.de/api/schools?distance=1&outposts=1&owner=&school_kind=&term=&types=&work_schedule=&_=1675079497084'

data = []

for uuid in [item['uuid'] for item in requests.get(url).json()]:
    url = url = f'https://schulfinder.kultus-bw.de/api/school?uuid={uuid}&_=1675072758525'
    data.append(
        requests.get(url).json()
    )

data
输出
[{'outpost_number': '0', 'name': 'Grundschule Aach', 'street': 'Schulstr.', 'house_number': '5', 'postcode': '78267', 'city': 'Aach', 'phone': '+4977741442', 'fax': None, 'email': 'poststelle@04146900.schule.bwl.de', 'website': None, 'tablet_tranche': None, 'tablet_platform': None, 'tablet_branches': None, 'tablet_trades': None, 'lat': 47.84399, 'lng': 8.85067, 'official': 0, 'branches': [{'branch_id': 12110, 'acronym': 'GS', 'description_long': 'Grundschule'}], 'trades': []}, {'outpost_number': '0', 'name': 'Braunenbergschule Grundschule Wasseralfingen', 'street': 'Steinstr.', 'house_number': '38', 'postcode': '73433', 'city': 'Aalen', 'phone': '+49736197700', 'fax': '+497361977019', 'email': 'poststelle@04125362.schule.bwl.de', 'website': 'http://www.braunenbergschule.de', 'tablet_tranche': None, 'tablet_platform': None, 'tablet_branches': None, 'tablet_trades': None, 'lat': 48.8612, 'lng': 10.11191, 'official': 0, 'branches': [{'branch_id': 12110, 'acronym': 'GS', 'description_long': 'Grundschule'}], 'trades': []},...]

相关问题