使用Selenium的Webscraping:即使超时,页面仍无法加载

kgqe7b3p  于 2023-04-18  发布在  其他
关注(0)|答案(1)|浏览(151)

我有麻烦得到一些动态网页显示之前,我可以刮的css元素。
我正在AirBNB上尝试RSelenium,试图在弗朗西斯科抓取一个示例列表。在AirBNB上,如果您单击列表,它会打开一个新窗口,显示列表的详细信息。我无法显示此详细信息页面。
我通过Docker托管了一个Selenium服务器,并使用了standalone-firefox:2.53.0镜像。
R脚本:

library(RSelenium)
url<- "https://www.airbnb.com/s/san-francisco/homes?adults=1"

remDr <- remoteDriver(
  remoteServerAddr = "localhost",
  port = 4445L
)
remDr$open()
remDr$setTimeout(type = "page load", milliseconds = 30000)
remDr$setImplicitWaitTimeout(milliseconds = 10000) 

#otherwise too fast need to wait for the page to load.
remDr$navigate(url)
#remDr$navigate(paste0(urls[[1]]))

#listings <- remDr$findElements(using = "css selector",'._8s3ctt')

remDr$screenshot(display=T)

remDr$findElements(using = "css selector",'._8s3ctt')[[1]]$clickElement()
id <- remDr$getWindowHandles()
remDr$switchToWindow(id[[2]][1])
price_night <- remDr$findElements(using="css selector","._tyxjp1")
descrpt <- remDr$findElements(using="css selector","._tqmy57")
parking <- remDr$findElements(using="css selector","._6c4wvw")

无论我在remDr$setTimeout中设置了多少ms,详细信息页面都不会显示。调用remDr$screenshot(display=TRUE)会产生以下图像:

这似乎表明,网页未能完全加载之前,我开始寻找CSS元素,我试图刮。
随附Selenium服务器上的日志摘录:

19:30:05.490 INFO - Executing: [new session: Capabilities [{nativeEvents=true, browserName=firefox, javascriptEnabled=true, version=, platform=ANY}]])
19:30:05.491 INFO - Creating a new session for Capabilities [{nativeEvents=true, browserName=firefox, javascriptEnabled=true, version=, platform=ANY}]
19:30:06.810 INFO - Done: [new session: Capabilities [{nativeEvents=true, browserName=firefox, javascriptEnabled=true, version=, platform=ANY}]]
19:30:18.650 INFO - Executing: [page load wait: 30000])
19:30:18.655 INFO - Done: [page load wait: 30000]
19:30:20.084 INFO - Executing: [implicitly wait: 10000])
19:30:20.089 INFO - Done: [implicitly wait: 10000]
19:30:26.267 INFO - Executing: [delete session: 5da351c5-bd0e-4a95-a357-c049b71ed680])
19:30:26.377 INFO - Done: [delete session: 5da351c5-bd0e-4a95-a357-c049b71ed680]
19:30:58.763 INFO - Executing: [new session: Capabilities [{nativeEvents=true, browserName=firefox, javascriptEnabled=true, version=, platform=ANY}]])
19:30:58.764 INFO - Creating a new session for Capabilities [{nativeEvents=true, browserName=firefox, javascriptEnabled=true, version=, platform=ANY}]
19:30:59.913 INFO - Done: [new session: Capabilities [{nativeEvents=true, browserName=firefox, javascriptEnabled=true, version=, platform=ANY}]]
19:31:00.922 INFO - Executing: [page load wait: 30000])
19:31:00.927 INFO - Done: [page load wait: 30000]
19:31:01.675 INFO - Executing: [implicitly wait: 10000])
19:31:01.680 INFO - Done: [implicitly wait: 10000]
19:31:03.800 INFO - Executing: [get: https://www.airbnb.com/s/san-francisco/homes?adults=1])
19:31:05.301 INFO - Done: [get: https://www.airbnb.com/s/san-francisco/homes?adults=1]
19:31:10.700 INFO - Executing: [find elements: By.cssSelector: ._8s3ctt])
19:31:10.741 INFO - Done: [find elements: By.cssSelector: ._8s3ctt]
19:31:10.806 INFO - Executing: [click: 0 [[FirefoxDriver: firefox on LINUX (1741f648-be7b-48e0-96c9-0c1d2e14a498)] -> css selector: ._8s3ctt]])
19:31:10.962 INFO - Done: [click: 0 [[FirefoxDriver: firefox on LINUX (1741f648-be7b-48e0-96c9-0c1d2e14a498)] -> css selector: ._8s3ctt]]
19:31:14.947 INFO - Executing: [get window handles])
19:31:14.950 INFO - Done: [get window handles]
19:31:15.860 INFO - Executing: [switch to window: {679b54f5-ec42-4ba6-8939-cb7b0d40a7b9}])
19:31:15.864 INFO - Done: [switch to window: {679b54f5-ec42-4ba6-8939-cb7b0d40a7b9}]
19:31:20.580 INFO - Executing: [find elements: By.cssSelector: ._tyxjp1])
19:31:30.592 INFO - Done: [find elements: By.cssSelector: ._tyxjp1]
19:31:44.099 INFO - Executing: [take screenshot])
19:31:44.145 INFO - Done: [take screenshot]

我没看到服务器端有什么问题,但我可能错了。
超时真的实现了吗?如果没有,有没有其他方法可以让页面在刮取之前完全加载?

oaxa6hgo

oaxa6hgo1#

我已经能够使用以下代码从网页中提取信息:

library(RSelenium)
shell('docker run -d -p 4446:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4446L, browserName = "firefox")

remDr$open()
remDr$navigate("https://www.airbnb.com/s/san-francisco/homes?adults=1")

web_Obj <- remDr$findElement("xpath", '/html/body/div[5]/div/div/div[1]/div/div[2]/div/div/div/div/div/div[2]/main/div[2]/div[2]/div/div/div/div/div/div[2]/div/div[2]/div/div/div/div[1]/div/div[1]/div[1]')
web_Obj$getElementText()

[[1]]
[1] "Hotel room in Nob Hill"

相关问题