当我使用R进行网页抓取时,我可以抓取跨越多个页面的单个列表吗?

hts6caw3  于 2022-12-06  发布在  其他
关注(0)|答案(1)|浏览(105)

好吧,我试着从这个网站上用狗的脾气来擦table:https://atts.org/breed-statistics/statistics-page1/
但是,该表总共跨越8个页面(因此有8个唯一的URL)
目前,对于表的第1页,我已经编写了以下代码:

url <- "https://atts.org/breed-statistics/statistics-page1/"

webpage <- read_html(url)

bn_data_html <- html_nodes(webpage, "td:nth-child(1)")
bn_data <- html_text(bn_data_html)

nt_data_html <- html_nodes(webpage, "td:nth-child(2)")
nt_data <- html_text(nt_data_html)

passed_data_html <- html_nodes(webpage, "td:nth-child(3)")
passed_data <- html_text(passed_data_html)

failed_data_html <- html_nodes(webpage, "td:nth-child(4)")
failed_data <- html_text(failed_data_html)

percent_data_html <- html_nodes(webpage, "td:nth-child(5)")
percent_data <- html_text(percent_data_html)

breeds <- data.frame(Breed = bn_data, Number_tested = nt_data, Passed = passed_data, Failed = failed_data, Percent = percent_data)

它非常适合从第一页抓取数据。但是,为了抓取整个表,我能想到的唯一方法是替换原始url,并为表的每个页面重新运行代码块八次。有没有一种方法可以做到这一点,而不必重新运行八次?假设表跨越100个页面,并重新运行代码,许多次只是"可行吗?

mnemlml8

mnemlml81#

这就是你如何把狗放入 Dataframe ,抓取1:8页。注意html_table()的用法。

library(tidyverse)
library(rvest)

get_dogs <- function(page) {
  str_c("https://atts.org/breed-statistics/statistics-page", page) %>%
    read_html() %>%
    html_table() %>%
    getElement(1) %>%
    janitor::row_to_names(1) %>%
    janitor::clean_names()
  }

dogs_df <- map_dfr(1:8, get_dogs)

# A tibble: 250 x 5
   breed_name                 tested passed failed percent
   <chr>                      <chr>  <chr>  <chr>  <chr>  
 1 AFGHAN HOUND               165    120    45     72.7%  
 2 AIREDALE TERRIER           110    86     24     78.2%  
 3 AKBASH DOG                 16     14     2      87.5%  
 4 AKITA                      598    465    133    77.8%  
 5 ALAPAHA BLUE BLOOD BULLDOG 12     9      3      75.0%  
 6 ALASKAN KLEE KAI           2      1      1      50.0%  
 7 ALASKAN MALAMUTE           244    207    37     84.8%  
 8 AMERICAN BANDAGGE          1      1      0      100.0% 
 9 AMERICAN BULLDOG           214    186    28     86.9%  
10 AMERICAN ESKIMO            86     71     15     82.6%  
# ... with 240 more rows
# i Use `print(n = ...)` to see more rows

相关问题