在R中使用xpath废弃JSON表

mqkwyuun  于 2023-03-27  发布在  其他
关注(0)|答案(1)|浏览(108)

我正在特灵使用R,特别是rvest包,在https://www.topuniversities.com/university-rankings/university-subject-rankings/2023/arts-humanities?&tab=indicators中删除表。表是动态的。从我所读到的,我可以使用这个包和xpath来删除它。我使用Chrome的开发工具获得了xpath。

library(rvest)

webpage <- read_html("https://www.topuniversities.com/university-rankings/university-subject-rankings/2023/arts-humanities?&page=34&tab=indicators")

links <- html_nodes(webpage, xpath = "/html/body/div[1]/div/div/div[1]/div[2]/main/section/div/section/section/div/div/article/div/div/div[3]/div/div[1]/div/section/div[4]/div")

但这不起作用。我尝试的另一种方法是:

# include the installed library rvest
library(rvest)

# call the url
url <- "https://www.topuniversities.com/university-rankings/university-subject-rankings/2023/arts-humanities?&tab=indicators"

# get the data
page <- read_html(url)

# filter the required data using xpath
rows <- html_nodes(page, xpath = "/html/body/div[1]/div/div/div[1]/div[2]/main/section/div/section/section/div/div/article/div/div/div[3]/div/div[1]/div/section/div[4]/div") %>% 
  html_text()

# print
rows

谢谢你的帮助。

hc8w905p

hc8w905p1#

您似乎意识到表内容来自JSON,通常这意味着它是由javascript获取的,并且不包含在页面源代码中,即不可能仅通过使用在支持javascript的浏览器中工作的相同选择器和XPath来提取rvest表内容。
将JSON重新排列成类似于页面上呈现的表格的东西,可以通过以下方式实现:

library(jsonlite)
library(dplyr)
library(purrr)
library(tidyr)

"https://www.topuniversities.com/rankings/endpoint?nid=3846211&page=0&items_per_page=15&tab=indicators&region=&countries=&cities=&search=&star=&sort_by=&order_by=" %>% 
  fromJSON() %>% 
  pluck("score_nodes") %>% 
  # fromJSON() transforms scores into list of 5x4 data.frames
  # selecting relevant columns and pivoting wider turns those into 1x5 data.frames,
  # single row per parent data.frame row, `scores` will remain a nested column for now
  mutate(scores = map(scores, 
                      \(x) select(x, indicator_name, score) %>% 
                        pivot_wider(names_from = indicator_name, values_from = score))) %>% 
  select(title, region, country, city, overall_score, rank_display, scores) %>% 
  # extract all 5 score columns
  unnest(scores)

结果:

#> # A tibble: 15 × 11
#>    title   region country city  overall_score rank_display `Academic Reputation`
#>    <chr>   <chr>  <chr>   <chr> <chr>         <chr>        <chr>                
#>  1 "Harva… North… United… Camb… 98.2          1            100                  
#>  2 "Unive… Europe United… Camb… 96.9          2            99.2                 
#>  3 "Unive… Europe United… Oxfo… 96.8          3            99.4                 
#>  4 "Stanf… North… United… Stan… 91.6          4            93.7                 
#>  5 "Unive… North… United… Berk… 91.5          5            95.4                 
#>  6 "Yale … North… United… New … 88.9          6            94.5                 
#>  7 "Colum… North… United… New … 88.6          =7           92.5                 
#>  8 "New Y… North… United… New … 88.6          =7           93.8                 
#>  9 "Unive… North… United… Los … 88.3          9            92.5                 
#> 10 "The U… Europe United… Edin… 87.8          10           92.2                 
#> 11 "UCL"   Europe United… Lond… 87            11           90.3                 
#> 12 "Massa… North… United… Camb… 86.6          12           86.3                 
#> 13 "Princ… North… United… Prin… 86            13           92.4                 
#> 14 "Unive… North… United… Chic… 85.9          14           93.3                 
#> 15 "Unive… North… Canada  Toro… 85.2          15           89                   
#> # ℹ 4 more variables: `Employer Reputation` <chr>, `Citations per Paper` <chr>,
#> #   `H-index Citations` <chr>, `International Research Network` <chr>

创建于2023-03-23带有reprex v2.0.2
请注意,API调用仅返回前15条记录,请随意调整URL中的items_per_page参数。

相关问题