如何从受限网站中删除表格

axzmvihb 于 2022-10-23 发布在其他

关注(0)|答案(1)|浏览(111)

我试图从一个特定的网站上搜集数据。后来，我意识到该网站不允许数据抓取。经过多次谷歌搜索，我得以绕过这条规则。然而，我无法从页面中获得所需的结果。我的目标是将表的前四页转换为csv文件。我能够获得表的标题（从附加的代码中），但不能获得表数据。我想在如何实现我的目标方面得到指导。

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.timeshighereducation.com/world-university-rankings/2023/world-ranking#!/page/0/length/100/sort_by/rank/sort_order/asc/cols/stats"

HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Cache-Control": "max-age=0",
    }

data = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(data.text,'html.parser')
table = soup.find_all("th", attrs={"class": ["stats","name","rank"]})

header = []
for i in table:
  header.append(i.text.strip())
df = pd.DataFrame(columns=header)

pandas

来源：https://stackoverflow.com/questions/74163073/how-to-scrape-table-from-restricted-website

1条答案

按热度按时间

hgb9j2n61#

此站点允许您以json格式获取数据，因此使用requests库就足够了。以下是一个可能的解决方案：

import requests
import pandas as pd

url = "https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2023_0__83be12210294c582db8740ee29673120.json"

headers = {
    "user-agent": "Mozilla/5.0"
}

# get a list of universities

response = requests.get(url=url, headers=headers)
data = response.json()['data']

# list of unused keys

key_list = [
    "scores_citations", "scores_citations_rank", "cta_button", "apply_link", "url",
    "scores_industry_income", "scores_industry_income_rank", "subjects_offered",
    "aliases", "rank_order", "scores_overall", "scores_overall_rank", "scores_teaching",
    "closed", "unaccredited", "disabled", "record_type", "scores_teaching_rank",
    "scores_research", "scores_research_rank", "scores_international_outlook", 
    "scores_international_outlook_rank", "member_level", "nid"
    ]

# old and new column names

columns = {
    "rank": "Rank", "name": "Name", "location": "Country", 
    "stats_number_students": "No. of FTE Students", 
    "stats_student_staff_ratio": "No. of students per staff", 
    "stats_pc_intl_students": "International Students", "stats_female_male_ratio": "Female:Male Ratio"
    }

tdf = []

# take the first 400 universities

for university in data[:400]:
    # remove unused keys
    for key in key_list :
        university.pop(key, None)
    # create DataFrame and change its orientation
    df = pd.DataFrame(list(university.items())).set_index(0).T
    tdf.append(df)

# reset index and rename columns

df = pd.concat(tdf).reset_index(drop=True).rename(columns=columns)

# save to csv

df.to_csv("university_rankings_2023.csv")

输出 Dataframe ：

0       Rank                                   Name         Country No. of FTE Students No. of students per staff International Students Female:Male Ratio
0          1                   University of Oxford  United Kingdom              20,967                      10.6                    42%           48 : 52
1          2                     Harvard University   United States              21,887                       9.6                    25%           50 : 50
2         =3                University of Cambridge  United Kingdom              20,185                      11.3                    39%           47 : 53
3         =3                    Stanford University   United States              16,164                       7.1                    24%           46 : 54
4          5  Massachusetts Institute of Technology   United States              11,415                       8.2                    33%           40 : 60
..       ...                                    ...             ...                 ...                       ...                    ...               ...
395  351–400                    University of Vaasa         Finland               3,873                      20.0                     4%           53 : 47
396  351–400                      Verona University           Italy              18,621                      23.8                     4%           64 : 36
397  351–400                 Wake Forest University   United States               8,122                       4.0                    10%           54 : 46
398  351–400            Washington State University   United States              29,463                      19.5                     7%           54 : 46
399  351–400             Wroclaw Medical University          Poland               6,769                       9.4                    14%           71 : 29

和university_rankings_2023.csv

赞(0）回复(0）举报 2022-10-23

我来回答

如何从受限网站中删除表格

1条答案

相关问题

热门标签

最新问答