Webscarping with rvest -获取表格和跨文本

wi3ka0sx  于 2023-06-19  发布在  其他
关注(0)|答案(2)|浏览(125)

我期待着得到这个链接(https://clinicaltrials.gov/ct2/history/NCT04658186)沿着一些行

悬停文本表。
我想要的结果是创建一个数据框,使悬停文本是在同一行作为其在网页上的列。尝试了下面的代码,我可以得到表和跨度文本分开,无法弄清楚如何合并在一起。

library(dplyr)
library(rvest)

 # Set the URL of the webpage containing the table
  url <- "https://clinicaltrials.gov/ct2/history/NCT04658186"
  
  # Read the HTML code from the webpage
  page <- read_html(url)
  
  # Use html_table() to extract the table data
  table_data <- page %>%
    html_table(fill = TRUE) %>%
    .[[1]] # Select the first table on the page
  
  # Use html_nodes() and html_text() to extract the text from span elements within the table
  span_text <- page %>% html_nodes("span") %>% 
    html_attr("title") %>% data.frame()

谢谢你的任何帮助提前。

7xzttuei

7xzttuei1#

在这种情况下,我们可以循环遍历元素列表(即表行)并从每个项目提取某些位。使用这种方法,我们最终会得到一个正确对齐的列表或向量,可以绑定到以前提取的表:

library(dplyr)
library(rvest)
library(purrr)

# Set the URL of the webpage containing the table
url <- "https://clinicaltrials.gov/ct2/history/NCT04658186"

# Read the HTML code from the webpage
page <- read_html(url)

table_data <- page %>%
  # selecting the target table first to get a single table from html_table()
  html_element("table") %>% 
  html_table(fill = TRUE)

# select all table rows, and cycle through those with map_chr(), 
# map_chr returns character vecotor of the same length as 
# input list (number of <tr> elements)
recr_stat <- page %>% html_elements("tbody tr") %>% 
  map_chr (\(tr) html_element(tr, "span.recruitmentStatus") %>% html_attr("title"))

# bind to table:
bind_cols(table_data, `Recruitment Status` = recr_stat) %>% 
  relocate(`Recruitment Status`, .before = Changes)
#> # A tibble: 58 × 6
#>    Version A     B     `Submitted Date` `Recruitment Status`             Changes
#>      <int> <lgl> <lgl> <chr>            <chr>                            <chr>  
#>  1       1 NA    NA    December 1, 2020 <NA>                             None (…
#>  2       2 NA    NA    January 12, 2021 Not yet recruiting --> Recruiti… Recrui…
#>  3       3 NA    NA    January 29, 2021 <NA>                             Contac…
#>  4       4 NA    NA    February 4, 2021 <NA>                             Study …
#>  5       5 NA    NA    March 4, 2021    <NA>                             Study …
#>  6       6 NA    NA    March 18, 2021   <NA>                             Contac…
#>  7       7 NA    NA    April 15, 2021   <NA>                             Study …
#>  8       8 NA    NA    May 14, 2021     <NA>                             Study …
#>  9       9 NA    NA    May 27, 2021     <NA>                             Contac…
#> 10      10 NA    NA    June 10, 2021    <NA>                             Study …
#> # ℹ 48 more rows

对于更健壮的方法,我们可以跳过html_table()并从每个元素中提取所有所需的细节(这里:tr)我们自己。这也适用于无表格设计,其中表格数据通过列表或div呈现。

results <- page %>% html_elements("tbody tr") %>% 
  map(\(tr) list(
    version  = html_element(tr, "td[headers='VersionNumber']") %>% html_text(),
    date     = html_element(tr, "td[headers='VersionDate']") %>% html_text(),
    recrstat = html_element(tr, "td[headers='Changes'] span.recruitmentStatus") %>% html_attr("title"),
    changes  = html_element(tr, "td[headers='Changes']") %>% html_text()
    )) %>% 
  bind_rows()

results %>% 
  mutate(version = as.integer(version),
         date = lubridate::mdy(date))
#> # A tibble: 58 × 4
#>    version date       recrstat                          changes                 
#>      <int> <date>     <chr>                             <chr>                   
#>  1       1 2020-12-01 <NA>                              None (earliest Version …
#>  2       2 2021-01-12 Not yet recruiting --> Recruiting Recruitment Status, Stu…
#>  3       3 2021-01-29 <NA>                              Contacts/Locations and …
#>  4       4 2021-02-04 <NA>                              Study Status and Contac…
#>  5       5 2021-03-04 <NA>                              Study Status and Contac…
#>  6       6 2021-03-18 <NA>                              Contacts/Locations and …
#>  7       7 2021-04-15 <NA>                              Study Status and Contac…
#>  8       8 2021-05-14 <NA>                              Study Status and Contac…
#>  9       9 2021-05-27 <NA>                              Contacts/Locations and …
#> 10      10 2021-10-20 <NA>                              Study Status and Contac…
#> # ℹ 48 more rows

创建于2023-06-15带有reprex v2.0.2

zsohkypk

zsohkypk2#

library(tidyverse)
library(rvest)

page <- "https://clinicaltrials.gov/ct2/history/NCT04658186" %>%
  read_html()

page %>% 
  html_table() %>%
  pluck(1) %>% 
  mutate(status = page %>%
           html_elements(".w3-bordered.releases") %>%
           pluck(1) %>%
           html_elements("tbody tr") %>%
           map_chr(.,
                   ~ .x %>%
                     html_element(".recruitmentStatus") %>%
                     html_attr("title")))

# A tibble: 58 × 6
   Version A     B     `Submitted Date` Changes                                                 status                           
     <int> <lgl> <lgl> <chr>            <chr>                                                   <chr>                            
 1       1 NA    NA    December 1, 2020 None (earliest Version on record)                       NA                               
 2       2 NA    NA    January 12, 2021 Recruitment Status, Study Status and Contacts/Locations Not yet recruiting --> Recruiting
 3       3 NA    NA    January 29, 2021 Contacts/Locations and Study Status                     NA                               
 4       4 NA    NA    February 4, 2021 Study Status and Contacts/Locations                     NA                               
 5       5 NA    NA    March 4, 2021    Study Status and Contacts/Locations                     NA                               
 6       6 NA    NA    March 18, 2021   Contacts/Locations and Study Status                     NA                               
 7       7 NA    NA    April 15, 2021   Study Status and Contacts/Locations                     NA                               
 8       8 NA    NA    May 14, 2021     Study Status and Contacts/Locations                     NA                               
 9       9 NA    NA    May 27, 2021     Contacts/Locations and Study Status                     NA                               
10      10 NA    NA    June 10, 2021    Study Status and Contacts/Locations                     NA                               
# ℹ 48 more rows
# ℹ Use `print(n = ...)` to see more rows

相关问题