R从子页面查找对应作者

xyhw6mcr 于 2023-11-14 发布在其他

关注(0)|答案(1)|浏览(107)

我一直在努力一步一步的解决方案，以找到从collections_html_subpages的通信作者。
我检查了网站，看到它是一个<a id="corresp-c1" href="mailto: [[email protected]](https://stackoverflow.com/cdn-cgi/l/email-protection) > FName LName</a>
我构建了以下代码。代码的工作原理如下它使用初始页面并挖掘单个文章的href。然后它应该使用html_node在单个文章中找到该标签。现在使用lapply和html_text我应该能够提取所有的通信作者主要是1。然而，我甚至只是得到标签。我不知道代码中的错误在哪里。
correspondence_authors.和t1都返回一个空的集合。任何关于我如何改进代码以获得所需结果的建议都将受到欢迎。

library(httr)  # will be use to make HTML GET and POST requests
library(rvest) # will be used to parse HTML
library(xml2)
library(tidyr) #will be used to remove NA
library(tidyverse)
article_year <- function(year){
  
}
str_1 <- "https://molecularbrain.biomedcentral.com/articles"
prefix_str_1 <- "https://molecularbrain.biomedcentral.com/"
doc <- httr::GET(str_1)
html <- read_html(content(doc, "text"))
#################### Title ####################
c_listing_title <- html_elements(html,"h3.c-listing__title")
a_element <- html_node(c_listing_title,"a")
a_href <- as.list(html_attr(a_element,"href"))
a_text <- lapply(a_element,html_text)

##################### 2 Page Depth #######################
merge_strings <- function(x){
  paste0(prefix_str_1,x)
}
sub_pages <- lapply(a_href,merge_strings)

########################Function_Read_Sub_Pages#####################

read_page_1 <- function(x){
  webpages <- httr::GET(x)
  html <- rvest::read_html(httr::content(webpages, "text"))
  return(html)
}

collection_html_sub_pages <- lapply(sub_pages,read_page_1)

##########################Correspondence_Author###################

correspondence_search <- function(x){
  rvest::html_node(x,"a#corresp-c1")
}
collection_html_sub_pages[[1]]
t1 <- rvest::html_element(collection_html_sub_pages[[1]],paste0('#corresp-c1'))
t2 <- rvest::html_elements(t1,"p")
correspondence_authors <- lapply(collection_html_sub_pages, correspondence_search)

字符串
我已经使用帮助函数来帮助构造我的代码，并将继续使用帮助函数来保持我的代码组织良好，并允许故障排除。我已经尝试了上面的代码和其余的工程，但获得通信作者的一部分。

r

来源：https://stackoverflow.com/questions/77404013/r-find-corresponding-authors-from-subpages

1条答案

按热度按时间

csga3l581#

您创建的文章URL不是该Web服务器上的有效路径。当您paste()prefix_str_1和a_href时，第一个以/结尾，后者以/开头，结果URL如下所示：https://molecularbrain.biomedcentral.com/articles//10.1186/s13041-023-01014-0;正确的URL应该是https://molecularbrain.biomedcentral.com/articles/10.1186/s13041-023-01014-0（文章后没有双/）。
最简单的修复方法是定义prefix_str_1，而不带尾/。

prefix_str_1 <- "https://molecularbrain.biomedcentral.com"

字符串
您还可以大大简化代码。

library(rvest) 

base_url <- "https://molecularbrain.biomedcentral.com"

index_html <- read_html(file.path(base_url, "articles"))

# Title and Links ---------------------------------------------------------

a_elements <- html_elements(index_html, "h3.c-listing__title a")
a_href <- html_attr(a_elements, "href")
a_text <- html_text(a_elements)

# subpages ----------------------------------------------------------------

html_sub_pages <- 
  lapply(paste0(base_url, a_href),
       read_html)

# Correspondence Author ---------------------------------------------------

lapply(html_sub_pages,
       html_elements,
       "#corresp-c1") |> 
  lapply(html_text)
#> [[1]]
#> [1] "Chao Qin"
#> 
#> [[2]]
#> [1] "Won Do Heo"
#> 
#> [[3]]
#> [1] "Seung-Jae Lee"
#> ...

型

赞(0）回复(0）举报 2023-11-14

我来回答

R从子页面查找对应作者

1条答案

相关问题

热门标签

最新问答