如何强制R For循环重试时,获得502坏网关时尝试从网站下载谷歌Mapshapefile

jfewjypa  于 2023-07-31  发布在  其他
关注(0)|答案(1)|浏览(77)

我有一个343行的数据集,其中只包含一个URL到(343)个托管Google Earth KMZ shapefile的网站,我必须下载。我有一个用R写的for循环,除了网站有点不可靠,经常提供“502坏网关错误”,所以当R循环遇到这个错误时,它崩溃并停止。所以我无法下载我需要的343个shapefile中的29个以上,因为错误非常频繁。你知道是否有一种方法可以强制R刷新网站,直到它工作(不再收到错误)并成功下载所需的shapefile,而不会跳过链接?
我在这里附上代码,包括我使用的包:

loadandinstall <- function(mypkg) {if (!is.element(mypkg, installed.packages()[,1])){
  install.packages(mypkg, repos="http://cran.r-project.org")}; library(mypkg, character.only=TRUE)}

loadandinstall("stringr")
loadandinstall("rvest")
loadandinstall("XML")
loadandinstall("maptools")
loadandinstall("rgeos")
loadandinstall("rgdal")
loadandinstall("foreign")
loadandinstall("raster")
loadandinstall("sp")
loadandinstall("parallel")
loadandinstall("snow")

#Reads in the CSV with my 343 websites (embedded Google Maps spatial location data)
urlstring<-read.csv(str_c(basedir,"AllPlantingLocationData_WebsiteSource_2019.csv"),header=TRUE,colClasses=c("character","character","character"))[,1]

#Create empty character vector
gurls<-read.csv(str_c(basedir,"AllPlantingLocationData_WebsiteSource_2019.csv"),header=TRUE,colClasses=c("character","character","character"))[,2]

#Create vector of file names for downloaded kmls
names<-read.csv(str_c(basedir,"AllPlantingLocationData_WebsiteSource_2019.csv"),header=TRUE,colClasses=c("character","character","character"))[,3]

for(i in 1:length(urlstring)){

  #Get the website source html code and convert to searchable list
  t<-as.list(readLines(urlstring[i]))
  
  #Locate the Google Maps URL within the source code
  m<-unlist(t[which(!is.na(str_locate(t,"<iframe src=")[,1]))])
  
  #Remove extra characters surrounding the Google Maps URL
  gurl<-unlist(strsplit(substring(str_c(m),14,nchar(m))," "))[1]
  gurl<-substring(gurl,1,nchar(gurl)-1)
  
  #Replace the "embed" command within the Google Maps URL to a "kml" command, to create a download trigger URL
  gurls[i]<-str_c(gsub("embed","kml",gurl))
  
  #Download file
  browseURL(gurls[i])
  
  #Extract relevant file name into the new database
  Sys.sleep(10) ### Leave the newest file enough time to download ###
  tmpshot <- fileSnapshot("/Users/badiskhiari/Downloads/")
  names[i]<-rownames(tmpshot$info[which.max(tmpshot$info$mtime),])
  
  print(str_c("URL ",i," complete!!"))
  
}

字符串

yh2wf1be

yh2wf1be1#

使用httr2进行下载时,可以使用httr2::req_retry()控制请求重试策略。剩下的只是做一个工作示例:

library(rvest)
library(httr2)
library(stringr)
library(purrr)
library(dplyr)
library(cli)

# test urls
urls <- c("https://ngp.denr.gov.ph/index.php?option=com_content&view=article&id=1721&catid=10&Itemid=101",
          "https://ngp.denr.gov.ph/index.php?option=com_content&view=article&id=370&catid=10&Itemid=101",
          "https://ngp.denr.gov.ph/index.php?option=com_content&view=article&id=704&catid=10&Itemid=101")

# scrape a list of urls with purrr:map & rvest,
# build a dataframe for Google Maps KMZ links and local file names
kmz_df <- map(urls, read_html, .progress = TRUE) %>% 
  map(\(html) html_element(html, '[itemprop = "articleBody"]') %>%
        {
          list( 
            site = html_element(. ,"h3") %>% html_text() %>% trimws(),
            gurl = html_element(. ,"iframe") %>% html_attr("src") %>% str_replace(fixed("/embed?"), "/kml?")
          )
        }
  ) %>% bind_rows() %>% 
  mutate(kmz = NA_character_)
#> ■■■■■■■■■■■■■■■■■■■■■ 67% | ETA: 2s

# resulting dataframe / tibble:
kmz_df
#> # A tibble: 3 × 3
#>   site                                      gurl                           kmz  
#>   <chr>                                     <chr>                          <chr>
#> 1 2017 NGP PLANTING SITES IN CENRO LIANGA   https://www.google.com/maps/d… <NA> 
#> 2 2011 NGP PLANTING SITES IN PENRO LA UNION https://www.google.com/maps/d… <NA> 
#> 3 2012 NGP PLANTING SITES IN CENRO SARA     https://www.google.com/maps/d… <NA>

# inital values for progress 
idx <- 1
n <- nrow(kmz_df)

cli_progress_step("Donloading {idx}/{n} ...")
for (idx in seq_along(kmz_df$site)){
  cli_progress_update()
  
  # download only missing files - interrupted loop can be resumed later and only missing files 
  # will be downloaded (you csn store / restore kmz_df dataframe to enable resuming between sessions)
  if (is.na(kmz_df$kmz[idx])){
    # download with httr2, retry max 5 times
    resp <- request(kmz_df$gurl[idx]) %>% 
      req_retry(max_tries = 5) %>% 
      req_perform(path = "out.kmz") 
    
    # after successful request, extract filename from response header,
    # rename the downloaded file and store the name in dataframe
    if (!resp_is_error(resp)){
      fname <- resp_header(resp, "Content-Disposition") %>% str_extract('(?<=filename\\=")[^"]+')
      file.rename("out.kmz", fname)
      kmz_df$kmz[idx] <- fname
    }else{
      cli_alert_warning("Failed to fetch KMZ for \"{kmz_df$site[idx]}\" ( {kmz_df$gurl[idx]} ): {resp_status_desc(resp)}")
    }
  }
}

cli_progress_message("Done")
#> ✔ Donloading 3/3 ... [8.2s]
#> Done

# updated dataframe, with names of downloaded files:
kmz_df[,c("site", "kmz")]
#> # A tibble: 3 × 2
#>   site                                      kmz                  
#>   <chr>                                     <chr>                
#> 1 2017 NGP PLANTING SITES IN CENRO LIANGA   CENRO_LIANGA_2017.kmz
#> 2 2011 NGP PLANTING SITES IN PENRO LA UNION LAUNION2011N.kmz     
#> 3 2012 NGP PLANTING SITES IN CENRO SARA     CENROSARA2012.kmz

# downloaded files:
fs::dir_info(glob = "*.kmz")[1:3]
#> # A tibble: 3 × 3
#>   path                  type         size
#>   <fs::path>            <fct> <fs::bytes>
#> 1 CENROSARA2012.kmz     file        3.27M
#> 2 CENRO_LIANGA_2017.kmz file        2.13M
#> 3 LAUNION2011N.kmz      file       14.82K

字符串
创建于2023-07-21使用reprex v2.0.2

相关问题