我试图从一个特定的网站上搜集数据。后来,我意识到该网站不允许数据抓取。经过多次谷歌搜索,我得以绕过这条规则。然而,我无法从页面中获得所需的结果。我的目标是将表的前四页转换为csv文件。我能够获得表的标题(从附加的代码中),但不能获得表数据。我想在如何实现我的目标方面得到指导。
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.timeshighereducation.com/world-university-rankings/2023/world-ranking#!/page/0/length/100/sort_by/rank/sort_order/asc/cols/stats"
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}
data = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(data.text,'html.parser')
table = soup.find_all("th", attrs={"class": ["stats","name","rank"]})
header = []
for i in table:
header.append(i.text.strip())
df = pd.DataFrame(columns=header)
1条答案
按热度按时间hgb9j2n61#
此站点允许您以
json
格式获取数据,因此使用requests
库就足够了。以下是一个可能的解决方案:输出 Dataframe :
和
university_rankings_2023.csv