当我使用R进行网页抓取时，我可以抓取跨越多个页面的单个列表吗？

hts6caw3 于 2022-12-06 发布在其他

关注(0)|答案(1)|浏览(106)

好吧，我试着从这个网站上用狗的脾气来擦table：https://atts.org/breed-statistics/statistics-page1/
但是，该表总共跨越8个页面（因此有8个唯一的URL）
目前，对于表的第1页，我已经编写了以下代码：

url <- "https://atts.org/breed-statistics/statistics-page1/"

webpage <- read_html(url)

bn_data_html <- html_nodes(webpage, "td:nth-child(1)")
bn_data <- html_text(bn_data_html)

nt_data_html <- html_nodes(webpage, "td:nth-child(2)")
nt_data <- html_text(nt_data_html)

passed_data_html <- html_nodes(webpage, "td:nth-child(3)")
passed_data <- html_text(passed_data_html)

failed_data_html <- html_nodes(webpage, "td:nth-child(4)")
failed_data <- html_text(failed_data_html)

percent_data_html <- html_nodes(webpage, "td:nth-child(5)")
percent_data <- html_text(percent_data_html)

breeds <- data.frame(Breed = bn_data, Number_tested = nt_data, Passed = passed_data, Failed = failed_data, Percent = percent_data)

它非常适合从第一页抓取数据。但是，为了抓取整个表，我能想到的唯一方法是替换原始url，并为表的每个页面重新运行代码块八次。有没有一种方法可以做到这一点，而不必重新运行八次？假设表跨越100个页面，并重新运行代码，许多次只是"可行吗？

r

来源：https://stackoverflow.com/questions/74620291/can-i-scrape-a-single-list-that-spans-across-multiple-pages-when-webscraping-wit

1条答案

按热度按时间

mnemlml81#

这就是你如何把狗放入 Dataframe ，抓取1:8页。注意html_table()的用法。

library(tidyverse)
library(rvest)

get_dogs <- function(page) {
  str_c("https://atts.org/breed-statistics/statistics-page", page) %>%
    read_html() %>%
    html_table() %>%
    getElement(1) %>%
    janitor::row_to_names(1) %>%
    janitor::clean_names()
  }

dogs_df <- map_dfr(1:8, get_dogs)

# A tibble: 250 x 5
   breed_name                 tested passed failed percent
   <chr>                      <chr>  <chr>  <chr>  <chr>  
 1 AFGHAN HOUND               165    120    45     72.7%  
 2 AIREDALE TERRIER           110    86     24     78.2%  
 3 AKBASH DOG                 16     14     2      87.5%  
 4 AKITA                      598    465    133    77.8%  
 5 ALAPAHA BLUE BLOOD BULLDOG 12     9      3      75.0%  
 6 ALASKAN KLEE KAI           2      1      1      50.0%  
 7 ALASKAN MALAMUTE           244    207    37     84.8%  
 8 AMERICAN BANDAGGE          1      1      0      100.0% 
 9 AMERICAN BULLDOG           214    186    28     86.9%  
10 AMERICAN ESKIMO            86     71     15     82.6%  
# ... with 240 more rows
# i Use `print(n = ...)` to see more rows

赞(0）回复(0）举报 2022-12-06

我来回答

当我使用R进行网页抓取时，我可以抓取跨越多个页面的单个列表吗？

1条答案

相关问题

热门标签

最新问答