我有麻烦得到一些动态网页显示之前,我可以刮的css元素。
我正在AirBNB上尝试RSelenium,试图在弗朗西斯科抓取一个示例列表。在AirBNB上,如果您单击列表,它会打开一个新窗口,显示列表的详细信息。我无法显示此详细信息页面。
我通过Docker托管了一个Selenium服务器,并使用了standalone-firefox:2.53.0
镜像。
R脚本:
library(RSelenium)
url<- "https://www.airbnb.com/s/san-francisco/homes?adults=1"
remDr <- remoteDriver(
remoteServerAddr = "localhost",
port = 4445L
)
remDr$open()
remDr$setTimeout(type = "page load", milliseconds = 30000)
remDr$setImplicitWaitTimeout(milliseconds = 10000)
#otherwise too fast need to wait for the page to load.
remDr$navigate(url)
#remDr$navigate(paste0(urls[[1]]))
#listings <- remDr$findElements(using = "css selector",'._8s3ctt')
remDr$screenshot(display=T)
remDr$findElements(using = "css selector",'._8s3ctt')[[1]]$clickElement()
id <- remDr$getWindowHandles()
remDr$switchToWindow(id[[2]][1])
price_night <- remDr$findElements(using="css selector","._tyxjp1")
descrpt <- remDr$findElements(using="css selector","._tqmy57")
parking <- remDr$findElements(using="css selector","._6c4wvw")
无论我在remDr$setTimeout
中设置了多少ms,详细信息页面都不会显示。调用remDr$screenshot(display=TRUE)
会产生以下图像:
这似乎表明,网页未能完全加载之前,我开始寻找CSS元素,我试图刮。
随附Selenium服务器上的日志摘录:
19:30:05.490 INFO - Executing: [new session: Capabilities [{nativeEvents=true, browserName=firefox, javascriptEnabled=true, version=, platform=ANY}]])
19:30:05.491 INFO - Creating a new session for Capabilities [{nativeEvents=true, browserName=firefox, javascriptEnabled=true, version=, platform=ANY}]
19:30:06.810 INFO - Done: [new session: Capabilities [{nativeEvents=true, browserName=firefox, javascriptEnabled=true, version=, platform=ANY}]]
19:30:18.650 INFO - Executing: [page load wait: 30000])
19:30:18.655 INFO - Done: [page load wait: 30000]
19:30:20.084 INFO - Executing: [implicitly wait: 10000])
19:30:20.089 INFO - Done: [implicitly wait: 10000]
19:30:26.267 INFO - Executing: [delete session: 5da351c5-bd0e-4a95-a357-c049b71ed680])
19:30:26.377 INFO - Done: [delete session: 5da351c5-bd0e-4a95-a357-c049b71ed680]
19:30:58.763 INFO - Executing: [new session: Capabilities [{nativeEvents=true, browserName=firefox, javascriptEnabled=true, version=, platform=ANY}]])
19:30:58.764 INFO - Creating a new session for Capabilities [{nativeEvents=true, browserName=firefox, javascriptEnabled=true, version=, platform=ANY}]
19:30:59.913 INFO - Done: [new session: Capabilities [{nativeEvents=true, browserName=firefox, javascriptEnabled=true, version=, platform=ANY}]]
19:31:00.922 INFO - Executing: [page load wait: 30000])
19:31:00.927 INFO - Done: [page load wait: 30000]
19:31:01.675 INFO - Executing: [implicitly wait: 10000])
19:31:01.680 INFO - Done: [implicitly wait: 10000]
19:31:03.800 INFO - Executing: [get: https://www.airbnb.com/s/san-francisco/homes?adults=1])
19:31:05.301 INFO - Done: [get: https://www.airbnb.com/s/san-francisco/homes?adults=1]
19:31:10.700 INFO - Executing: [find elements: By.cssSelector: ._8s3ctt])
19:31:10.741 INFO - Done: [find elements: By.cssSelector: ._8s3ctt]
19:31:10.806 INFO - Executing: [click: 0 [[FirefoxDriver: firefox on LINUX (1741f648-be7b-48e0-96c9-0c1d2e14a498)] -> css selector: ._8s3ctt]])
19:31:10.962 INFO - Done: [click: 0 [[FirefoxDriver: firefox on LINUX (1741f648-be7b-48e0-96c9-0c1d2e14a498)] -> css selector: ._8s3ctt]]
19:31:14.947 INFO - Executing: [get window handles])
19:31:14.950 INFO - Done: [get window handles]
19:31:15.860 INFO - Executing: [switch to window: {679b54f5-ec42-4ba6-8939-cb7b0d40a7b9}])
19:31:15.864 INFO - Done: [switch to window: {679b54f5-ec42-4ba6-8939-cb7b0d40a7b9}]
19:31:20.580 INFO - Executing: [find elements: By.cssSelector: ._tyxjp1])
19:31:30.592 INFO - Done: [find elements: By.cssSelector: ._tyxjp1]
19:31:44.099 INFO - Executing: [take screenshot])
19:31:44.145 INFO - Done: [take screenshot]
我没看到服务器端有什么问题,但我可能错了。
超时真的实现了吗?如果没有,有没有其他方法可以让页面在刮取之前完全加载?
1条答案
按热度按时间oaxa6hgo1#
我已经能够使用以下代码从网页中提取信息: