R从数千页中抓取图像细节

fdbelqdn 于 2022-12-20 发布在其他

关注(0)|答案(2)|浏览(107)

我试图从一个网站刮细节，以收集与脚本在R的图片细节。
我需要的是：

图像名称（1.jpg）
图片说明（"一名新兵演示如何正确使用二氧化碳便携式灭火器扑灭外面的小火。"）
图片来源（"图片提供者：詹姆斯·福特纳"）

有超过16，000个文件，谢天谢地的是网址是"... asp？photo = 1，2，3，4"，所以有一个基本网址没有改变，只是最后一节的图像编号。我希望脚本循环设置编号（我告诉它从哪里开始）或它只是打破时，它到达了一个不存在的页面。
使用下面的代码，我可以得到照片的标题，但只有一行。我想得到的照片学分，这是在一个单独的行;有三个
之间的主要标题和图片信贷。我会很好，如果生成的表有两个或三个空白列，以说明
行，因为我可以稍后删除它们。

library(rvest)
library(dplyr)

link = "http://fallschurchvfd.org/photovideo.asp?photo=1"
page = read_html(link)

caption = page %>% html_nodes(".text7 i") %>% html_text()

info = data.frame(caption, stringsAsFactors = FALSE)
write.csv(info, "photos.csv")

来源：https://stackoverflow.com/questions/74779657/r-scraping-for-image-details-from-several-thousand-pages

2条答案

按热度按时间

toe950271#

用rvest和tidyverse刮除

library(tidyverse)
library(rvest)

get_picture <- function(page) {
  cat("Scraping page", page, "\n")
  
  page <- str_c("http://fallschurchvfd.org/photovideo.asp?photo=", page) %>%
    read_html()
  
  tibble(
    image_name = page %>%  
      html_element(".text7 img") %>%
      html_attr("src"),
    caption = page %>%
      html_element(".text7") %>%
      html_text() %>%
      str_split(pattern = "\r\n\t\t\t\t") %>%
      unlist %>% 
      nth(1),
    credit = page %>%
      html_element(".text7") %>%
      html_text() %>%
      str_split(pattern = "\r\n\t\t\t\t") %>%
      unlist %>% 
      nth(3)
  )
}

# Get the first 1:50 
df <- map_dfr(1:50, possibly(get_picture, otherwise = tibble()))

# A tibble: 42 × 3
   image_name     caption                                   credit
   <chr>          <chr>                                     <chr> 
 1 /photos/1.jpg  Recruit Clay Hamric demonstrates the use… James…
 2 /photos/2.jpg  A recruit demonstrates the proper use of… James…
 3 /photos/3.jpg  Recruit Paul Melnick demonstrates the pr… James…
 4 /photos/4.jpg  Rescue 104                                James…
 5 /photos/5.jpg  Rescue 104                                James…
 6 /photos/6.jpg  Rescue 104                                James…
 7 /photos/15.jpg Truck 106 operates a ladder pipe from Wi… Jim O…
 8 /photos/16.jpg Truck 106 operates a ladder pipe as heav… Jim O…
 9 /photos/17.jpg Heavy fire vents from the roof area of t… Jim O…
10 /photos/18.jpg Arlington County Fire and Rescue Associa… James…
# … with 32 more rows
# ℹ Use `print(n = ...)` to see more rows

赞(0）回复(0）举报 2022-12-20

qmb5sa222#

对于映像，可以使用命令行工具curl。例如，要下载映像1.jpg到100.jpg

curl -O "http://fallschurchvfd.org/photos/[0-100].jpg"

对于R代码，如果您获取整个.text7部分，则可以随后拆分为标题和照片来源：

extractedtext <- page %>% html_nodes(".text7") %>% html_text()
caption <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][1]
credit <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][3]

作为一个循环

library(rvest)
library(tidyverse)

df<-data.frame(id=1:20, 
               image=NA,
               caption=NA,
               credit=NA)
for (i in 1:20){
  cat(i, " ") # to monitor progress and debug
  link <- paste0("http://fallschurchvfd.org/photovideo.asp?photo=", i)
  tryCatch({ # This is to avoid stopping on an error message for missing pages
            page <- read_html(link)
            close(link)
            df$image[i] <- page %>% html_nodes(".text7 img") %>% html_attr("src")
            extractedtext <- page %>% html_nodes(".text7") %>% html_text()
            df$caption[i] <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][1] # This is an awkward way of saying "list 1, element 1"
            df$credit[i] <- str_split(extractedtext, "\r\n\t\t\t\t")[[1]][3]
            }, 
           error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}

我得到了与当前代码不一致的结果，例如，第15页的换行符比第1页多。
TODO：增强字符串提取;切换到将数据添加到 Dataframe “append”方法（相对于预分配和插入）。

赞(0）回复(0）举报 2022-12-20

我来回答

R从数千页中抓取图像细节

2条答案

相关问题

热门标签

最新问答