RSelenium ->单击复选框文件

oewdyzsn  于 2023-05-04  发布在  其他
关注(0)|答案(1)|浏览(167)

基本上,我想自动下载多个文件一次从网页-〉〉http://alertario.rio.rj.gov.br/download/dados-pluviometricos/
我目前正在学习以下教程:https://www.youtube.com/watch?v=BK_JBk_l5uQ;在这里:https://github.com/ggSamoora/TutorialsBySamoora/blob/main/R_downloader_Tutorial.R
但是,要下载它,我需要事先指定(选择)几个字段。检查以下图像。

有人能帮我自动化吗?
我目前的阶段:

#install.packages("RSelenium")
#install.packages("netstat")
#install.packages("binman")

# load the necessary packages
library(tidyverse)
library(RSelenium)
library(netstat)

binman::list_versions("geckodriver")
# "0.32.1" "0.32.2" "0.33.0"

# connecting to selenium server
rs_driver_object <- rsDriver(browser = 'firefox',
                             port = free_port())

# access the client object
remDr <- rs_driver_object$client

# open a web browser
remDr$open()

# navigate to the website containing the database
remDr$navigate("http://alertario.rio.rj.gov.br/download/dados-pluviometricos/")

我期待着下载所有的数据可从这个网页上的一个研究项目。

pw9qyyiw

pw9qyyiw1#

这个特殊的问题实际上并不需要RSelenium,所以如果您愿意接受更典型的方法,那么这个答案可能对您有用。网站使用POST请求将数据作为zip文件拉下,因此我们需要做的就是发出自己的POST请求。我更喜欢httr包,但你可以使用任何你喜欢的。
我们通过手动提交单个请求并使用Chrome的devtools查看请求的内容来获取所需的信息(主体和标题):

在本例中,重要信息是请求URL
http://websempre.rio.rj.gov.br/dados/pluviometricos/plv/
请求报头

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9
Cache-Control: max-age=0
Connection: keep-alive
Content-Length: 949
Content-Type: application/x-www-form-urlencoded
Cookie: _ga=GA1.4.856843378.1683144629; _gid=GA1.4.1868063308.1683144629; BIGipServer~interno~pool_websempre_http=rd1o00000000000000000000ffff0a02df72o80; _gat=1; TS01a4bab6=01a427213d9189188aaff0fbe3a73727c18f2fc4dc5b10c78d93f8867a481703e3840be508d0f33440460c6f7de39c5d1e4e830651541ff39ba0a1913d3ce11fc2a21fb05d; TS97dc297c027=087c8a1c25ab2000cfabcd20d6f9ccacc7398aab844381c3d40417e34a9d0935ff715cc2f2b63ac208608fab44113000be24740de9c4c96989c4111d3b3ee12ea1b6b438e414a7536af359832b14805819bed161727885be6f511f76c8015cd3
DNT: 1
Host: websempre.rio.rj.gov.br
Origin: http://websempre.rio.rj.gov.br
Referer: http://websempre.rio.rj.gov.br/dados/pluviometricos/plv/
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36

和请求正文(来自“Payload”选项卡)

csrfmiddlewaretoken=t0QnQFV4xRrh3eXmIHaxGRAXCDeDBM7F&1-check=on&1-choice=1997&2-check=on&2-choice=1997&3-check=on&3-choice=1997&4-check=on&4-choice=1997&5-check=on&5-choice=1997&6-check=on&6-choice=1997&7-check=on&7-choice=1997&8-check=on&8-choice=1997&9-check=on&9-choice=1997&10-check=on&10-choice=1997&11-check=on&11-choice=1997&12-check=on&12-choice=1997&13-check=on&13-choice=1997&14-check=on&14-choice=1997&15-check=on&15-choice=1997&16-check=on&16-choice=1997&17-check=on&17-choice=1997&18-check=on&18-choice=1997&19-check=on&19-choice=1997&20-check=on&20-choice=1997&21-check=on&21-choice=1997&22-check=on&22-choice=1997&23-check=on&23-choice=1997&24-check=on&24-choice=1997&25-check=on&25-choice=1997&26-check=on&26-choice=1997&27-check=on&27-choice=1997&28-check=on&28-choice=1997&29-check=on&29-choice=1997&30-check=on&30-choice=1997&31-check=on&31-choice=1997&32-check=on&32-choice=1997&33-check=on&33-choice=1997&all-chek=on&choice=1997

我们现在要做的就是用一种R友好的方式把它们编码起来。我们需要请求体是一个命名的字符向量,所以我们使用strsplitseparatepull(最后两个分别来自tidyrdplyr包):

chromebody <- "csrfmiddlewaretoken=pZOjhFqzBVeajAXAWhuNOctqSJ1GU04t&1-check=on&1-choice=1997&2-check=on&2-choice=1997&3-check=on&3-choice=1997&4-check=on&4-choice=1997&5-check=on&5-choice=1997&6-check=on&6-choice=1997&7-check=on&7-choice=1997&8-check=on&8-choice=1997&9-check=on&9-choice=1997&10-check=on&10-choice=1997&11-check=on&11-choice=1997&12-check=on&12-choice=1997&13-check=on&13-choice=1997&14-check=on&14-choice=1997&15-check=on&15-choice=1997&16-check=on&16-choice=1997&17-check=on&17-choice=1997&18-check=on&18-choice=1997&19-check=on&19-choice=1997&20-check=on&20-choice=1997&21-check=on&21-choice=1997&22-check=on&22-choice=1997&23-check=on&23-choice=1997&24-check=on&24-choice=1997&25-check=on&25-choice=1997&26-check=on&26-choice=1997&27-check=on&27-choice=1997&28-check=on&28-choice=1997&29-check=on&29-choice=1997&30-check=on&30-choice=1997&31-check=on&31-choice=1997&32-check=on&32-choice=1997&33-check=on&33-choice=1997&all-chek=on&choice=1997"
body <- strsplit(chromebody, "&")[[1]] %>%
  data.frame(init=.) %>%
  separate(init, into = c("name", "value"), sep = "=") %>%
  pull(value, name) %>%
  as.list()

这里的中间件令牌似乎也会随着每个请求而改变,因此您可能必须添加自己的中间件令牌。
然后添加必要的标题作为命名列表:

heads <- add_headers(c(
  Accept="text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
  `Accept-Encoding`="gzip, deflate",
  `Accept-Language`="en-US,en;q=0.9",
  `Cache-Control`="max-age=0",
  Connection="keep-alive",
  # `Content-Length`="561",
  # `Content-Type`="application/x-www-form-urlencoded",
  Cookie="_ga=GA1.4.856843378.1683144629; _gid=GA1.4.1868063308.1683144629; _gat=1; BIGipServer~interno~pool_websempre_http=rd1o00000000000000000000ffff0a02df72o80; TS01a4bab6=01a427213d59269d7c5c5786c4e31eb85e255c954942cbdc35eef8018262ac13b6852857ef0c654e412681b104aa44a4962091e7352e34338e28f74cd80e4856cf86705e54; TS97dc297c027=087c8a1c25ab2000d80e586f48bcd94dd8b861f67ab2f6791bcc83cfd93a01b01d36fddcc19b973a08ae4bf3f211300008ca4399aa2b8c0ad52ff748804cba793ea0af1daa5e9b0b374cba61997313ecab8ca3cd8268dd0a9172e0dc4788c29e",
  Host="websempre.rio.rj.gov.br",
  Origin="http://websempre.rio.rj.gov.br",
  Referer="http://websempre.rio.rj.gov.br/dados/pluviometricos/plv/",
  `User-Agent`="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
))

我注解掉了内容长度和内容类型,因为它们会导致问题,但其他所有内容基本上都是来自devtools任务栏的逐字记录。这里的cookie也可能会发生变化-您可能需要在提交自己的请求后添加自己的值。
然后,我们需要做的就是用参数发出POST请求。这里我使用write_disk(),因为我不知道如何在内存中进行解编码/解压缩。这里我只是把文件写到我的Downloads文件夹中,但是你可能想改变你的工作目录的路径。

post_response <- POST(base_url, body = body, config = heads, write_disk(path = "~/../Downloads/tempfile.zip", overwrite = TRUE))

但是,请注意,这只会拉下一年的文件。您将不得不编写一个快速循环,通过将当前请求体中的“1997”替换为1998、1999等,来从每年中提取文件。
最后,请注意,您正在使用大量请求这些数据的服务器,因此请注意请求率。

相关问题