R中的Web Scrape数字

bejyjqdl  于 2023-07-31  发布在  其他
关注(0)|答案(2)|浏览(84)

在R语言中,我试图从以下网页中抓取所有工作论文#(例如,31424,31481等):
https://www.nber.org/papers?facet=topics%3AFinancial%20Economics&page=1&perPage=50&sortBy=public_date
我试着运行下面的代码来得到这样的:

url<-"https://www.nber.org/papers?facet=topics%3AFinancial%20Economics&page=1&perPage=50&sortBy=public_date"
page=read_html(url)
name=page%>%html_nodes(".paper-card__paper_number")%>%html_text()

字符串
然而,这段代码返回character(0),没有给我工作文件#。有没有什么方法可以修改此代码以获得工作论文#的

jaql4c8m

jaql4c8m1#

要抓取动态生成的内容,您可以使用像RSelenium这样的无头浏览器自动化工具,它允许您以编程方式控制真实的的Web浏览器。下面是如何修改代码来实现这一点:
1.首先,确保安装了RSelenium和rvest:

install.packages("RSelenium")
install.packages("rvest")

字符串
2.加载所需的库:

library(RSelenium)
library(rvest)


3.启动Selenium服务器并打开浏览器:

driver <- rsDriver(browser="chrome", chromever="latest", port=4567L)
remDr <- driver[["client"]]


4.导航到所需的URL:

url <- "https://www.nber.org/papers?facet=topics%3AFinancial%20Economics&page=1&perPage=50&sortBy=public_date"
remDr$navigate(url)


5.获取工作底稿编号:

page_source <- remDr$getPageSource()[[1]]
page <- read_html(page_source)
name <- page %>% html_nodes(".paper-card__paper_number") %>% html_text()


6.停止Selenium服务器并关闭浏览器:

remDr$close()
driver$server$stop()
jhdbpxl9

jhdbpxl92#

Selenium的另一个替代方法是查询NBER的restful API,它将返回一个相当简单的json,带有一个类似data.frame的对象,其中不仅包含工作论文编号,还包含许多有用的信息,如作者,标题,日期等。访问API比使用Selenium快得多,因为服务器返回给客户端的数据少得多。
该API允许您分页,每个查询最多返回100个结果。您可以通过检查Web浏览器会话的网络流量来获取API的URL。

library(dplyr)
library(jsonlite)
    
url_to_json <- "https://www.nber.org/api/v1/working_page_listing/contentType/working_paper/_/_/search?facet=topics%3AFinancial%20Economics&page=1&perPage=100&sortBy=public_date"
json_p01    <- fromJSON(txt = url_to_json) 

df_p01 <- as_tibble(json_p01$results) |> 
          mutate(wp_id = sub(pattern = "^.*papers[/]w", replacement = "", url))

df_p01 |> select(displaydate, title, wp_id, abstract)
# A tibble: 100 × 4
   displaydate title                                      wp_id abstract
   <chr>       <chr>                                      <chr> <chr>   
 1 July 2023   The Impact of Money in Politics on Labor … 31481 We exam…
 2 July 2023   Aggregate Lending and Modern Financial In… 31484 Existin…
 3 July 2023   Financial Machine Learning                 31502 We surv…
 4 July 2023   Housing, Household Debt, and the Business… 31489 China a…
 5 July 2023   Selection-Neglect in the NFT Bubble        31498 Using t…
 6 July 2023   Social Security Claiming Intentions: Psyc… 31499 For man…
 7 July 2023   Sparse Modeling Under Grouped Heterogenei… 31424 Sparse …
 8 July 2023   Firms with Benefits? Nonwage Compensation… 31463 Using a…
 9 July 2023   The Credit Supply Channel of Monetary Pol… 31464 This pa…
10 July 2023   Bank Branch Density and Bank Runs          31462 Bank br…
# ℹ 90 more rows
# ℹ Use `print(n = ...)` to see more rows

字符串

如果您需要捕获第二个页面,请通过修改URL进行分页。

library(urltools)
url_to_json_page_2 <- urltools::param_set(urls = url_to_json,  key = "page", value = 2)

相关问题