我想抓取flashscore.com网页,该网页在访问期间完全以javascript呈现。我使用htmlunit进行渲染,现在已经出现了第一个问题,根本无法刮取页面。
@PostMapping("/startScraping")
public ResponseEntity<FlashScraper> startScraping(@NonNull @RequestBody FlashScraper flashScraper) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
logger.info("startScraping request incomming");
logger.info("Call URL: " + flashScraper.getScrapeUrl());
final WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED);
HtmlPage scrapePage = webClient.getPage(flashScraper.getScrapeUrl());
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.waitForBackgroundJavaScript(3000);
System.out.println(scrapePage.getByXPath("//*[@id=\"g_25_rwPxTVj1\"]"));
return new ResponseEntity(flashScraper, HttpStatus.OK);
}
在向startscraping端点发送post请求后,我得到了以下异常
2021-07-04 14:43:57.569 WARN 14872 --- [nio-8080-exec-2] c.g.htmlunit.DefaultCssErrorHandler : CSS warning: 'https://www.flashscore.com/res/_fs/build/livetableresponsive.2da0223.css' [1:8910] Ignoring the whole rule.
2021-07-04 14:43:58.035 WARN 14872 --- [nio-8080-exec-2] c.g.htmlunit.IncorrectnessListenerImpl : Obsolete content type encountered: 'text/javascript'.
2021-07-04 14:43:58.175 ERROR 14872 --- [nio-8080-exec-2] c.g.h.j.DefaultJavaScriptErrorListener : Error during JavaScript execution com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot find function entries in object function Object() { [native code] }. (script in https://www.flashscore.com/unsupported/ from (31, 9) to (53, 10)#35)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:949) ~[htmlunit-2.50.0.jar:2.50.0]
at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:598) ~[htmlunit-core-js-2.50.0.jar:na]
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:487) ~[htmlunit-core-js-2.50.0.jar:na]
at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.callSecured(HtmlUnitContextFactory.java:353) ~[htmlunit-2.50.0.jar:2.50.0]
你怀疑问题出在哪里吗?
thx t。
3条答案
按热度按时间kb5ga3dv1#
请在获取页面之前设置选项
我猜错误仍然存在,但是因为您设置了setThroweExceptionOnScriptError(false),执行将不再在错误时停止。
wmtdaxz32#
更改参数顺序后,我仍然收到以下错误-htmlunit浏览器也被转发到flashscore.com/unsupported
dgenwo3n3#
这段代码在这里毫无例外地工作-我猜您没有使用最新版本。
但我最终还是上了这一页
我不知道浏览器是否检查页面。也许您可以使用此示例代码并查看网络流量(使用charles或fiddler)以了解问题所在。或者检查是在客户端上用普通javascript完成的。如果您认为您找到了原因,请随时在github上打开一个问题,我将尝试使htmlunit更兼容。