如何在动态网站中从多个URL中抓取信息?

but5z9lq  于 2023-04-09  发布在  其他
关注(0)|答案(1)|浏览(106)

我试图从一个联合国网站上抓取数据来做一些网络分析。每个实体/合作伙伴将是一个节点,每个SDG将是一个链接。然而,为了获得我需要的.csv,我一直试图获得所有用于网络爬行的URL,但到目前为止我还没有成功。然后,在那之后,我需要代码来帮助我进入每个项目,并获得每个项目的实体和相关SDG。
到目前为止,我已经尝试了以下代码:

library(xml2)
library(rvest)
library(tidyverse)

httr::set_config(httr::user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"))

PARTNERSHIPS <- read_html("https://sdgs.un.org/partnerships/browse?page=0")

PROJECTNAMES <- PARTNERSHIPS %>% html_elements("span a")

PROJECT_URLS <- PARTNERSHIPS %>% 
  html_elements("span a") %>%
  html_attr("href")

PROJECT_NAMES <- PARTNERSHIPS %>%
  html_elements("span a") %>%
  html_text2()

PARTNERSHIP_ANALYSIS <- data.frame(PROJECT_NAMES,PROJECT_URLS)

这给了我在网页中显示的仅有的18个项目的列表。如下所示:
URLs List
我试着用“while”和“for”从423页中获取链接,但我没有得到超过10个。
我能做些什么来获得所有的链接,然后从每个URL中获得所有的信息?

ntjbwcob

ntjbwcob1#

您可以使用:

library(xml2)
library(rvest)
library(tidyverse)

httr::set_config(httr::user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"))

# 423 instead of 1
for (i in 0:1) {
  url <- paste("https://sdgs.un.org/partnerships/browse?page=",i, sep = "")
  PARTNERSHIPS <- read_html(url)
  
  PROJECTNAMES <- PARTNERSHIPS %>% html_elements("span a")
  
  PROJECT_URLS <- PARTNERSHIPS %>% 
    html_elements("span a") %>%
    html_attr("href")
  
  PROJECT_NAMES <- PARTNERSHIPS %>%
    html_elements("span a") %>%
    html_text2()
  
  for (j in 1:((length(PARTNERSHIPS |> html_elements("div.field-content")))/2)) {
    if(j == 1)
      GOALS <- PARTNERSHIPS %>%
        html_elements("div.field-content") %>%
        .[[2*j-1]] %>%
        html_elements("div.layout.layout--onecol") %>%
        html_elements("div.layout__region.layout__region--content") %>%
        html_elements("div.goals-lists") %>%
        html_elements("a") %>%
        html_text2() %>%
        paste(collapse = ",")
    else
      GOALS <- c(GOALS,
                 PARTNERSHIPS %>%
                   html_elements("div.field-content") %>%
                   .[[2*j-1]] %>%
                   html_elements("div.layout.layout--onecol") %>%
                   html_elements("div.layout__region.layout__region--content") %>%
                   html_elements("div.goals-lists") %>%
                   html_elements("a") %>%
                   html_text2() %>%
                   paste(collapse = ","))
  }
  
  if (i == 0) PARTNERSHIP_ANALYSIS <- data.frame(PROJECT_NAMES,PROJECT_URLS,GOALS)
  else PARTNERSHIP_ANALYSIS <- rbind.data.frame(PARTNERSHIP_ANALYSIS,
                                                data.frame(PROJECT_NAMES,PROJECT_URLS,GOALS))
}

nrow(PARTNERSHIP_ANALYSIS)

#> [1] 36

我刚读了这个网站的2个第一页,每个页面有18个项目,2个页面有36个 Dataframe 的观察结果。

相关问题