单击rvest中的提交链接

cnh2zyt3  于 2023-03-27  发布在  其他
关注(0)|答案(1)|浏览(104)

我试图从一个网站使用rvest刮数据.我在网页的html阅读,然后提取的形式.此后,我在表单中使用rvest::html_form_set进行更改,然后提交它.看了表单后,我意识到没有提交按钮。网站上可用的按钮是一个锚标记,带有脚本的href。我尝试使用rvest::session_follow_link(),但无法获取数据。这是不起作用的代码:

trademark_search_page <- rvest::session('https://ipindiaonline.gov.in/tmrpublicsearch/frmmain.aspx')
      search_form <-  rvest::html_form(trademark_search_page)[[1]]

      search_form <- search_form %>% rvest::html_form_set(`ctl00$ContentPlaceHolder1$TBWordmark` = 'Bull',
                                                          `ctl00$ContentPlaceHolder1$TBClass` = 32)

      resp <- trademark_search_page %>% rvest::session_submit(search_form) %>% 
        rvest::session_follow_link(xpath = '//a[@id = "ContentPlaceHolder1_BtnSearch"]')

有什么建议我应该做什么?

svujldwt

svujldwt1#

我认为使用rvest可能会比较棘手,因为按钮引用了一个javascript脚本。

# load libraries
library(RSelenium)

# define url ---------------------------------------------------------
url <- "https://ipindiaonline.gov.in/tmrpublicsearch/frmmain.aspx"

# define search terms ------------------------------
word_mark <- "Bull"
class_search_term <- "32"

# start RSelenium ------------------------------------------------------------

rD <- rsDriver(browser="firefox", port=4548L, chromever = NULL)
remDr <- rD[["client"]]

# Navigate to webpage -----------------------------------------------------
remDr$navigate(url)

# fill in the form ------------------------------------------------
# this finds the html element for each part of the form
# and fills it in with the value we want

# Wordmark
remDr$findElement(using = "id", value = "ContentPlaceHolder1_TBWordmark")$sendKeysToElement(list(word_mark))

# Class
remDr$findElement(using = "id", value = "ContentPlaceHolder1_TBClass")$sendKeysToElement(list(class_search_term))

# click submit button ---------------------------------------

remDr$findElements("id", "ContentPlaceHolder1_BtnSearch")[[1]]$clickElement()

下面是指向的页面的样子:

进入此页面后,您可以使用rvest获取更多详细信息链接列表

library(rvest)
library(magrittr)

# pull html from page
html <- remDr$getPageSource()[[1]]

# find all the html elements with the .LnkshowDetails class

more_details_butons <- html %>% read_html() %>% 
  html_nodes(".LnkshowDetails") %>%
  html_attr("id")

然后你可以循环通过所有的按钮并点击它们或拉取数据

相关问题