HtmlUnit(Java) - 快速入门学习 - 无界面浏览器

x33g5p2x  于2022-02-12 转载在 HTML5  
字(18.9k)|赞(0)|评价(0)|浏览(388)

注意: 对于百度翻译、百度搜索、腾讯翻译等页面依然抓取不了结果,对于加密的JS文件解析基本不生效 — 推荐使用Selenium爬复杂JS、以及加密JS页面的内容

1. 概述

官方文档: https://htmlunit.sourceforge.io/

有具体Demo的讲解文档(搭配官方文档效果更佳):https://www.scrapingbee.com/java-webscraping-book/

作用: 一个"用于Java程序的无GUI浏览器"。它对HTML文档进行建模,并提供一个API,允许您调用页面,填写表单,单击链接等…就像您在"正常"浏览器中所做的那样

2. 注意

2.0 js解析问题

根据官方文档描述,仅能解析js库: htmx, jQuery, jQuery, MochiKit, GWT, Sarissa, MooTools, Prototype, Ext, Dojo, Dojo, YUI所以遇到经过加密的JS文件、以及其他库很可能会解析失败 === 所以模拟抓百度翻译、腾讯翻译、有道翻译这些加密的JS抓不了,建议使用Selenium(Java)进行抓,不过这工具比较重,好用是非常好用、直接爬就完事压根就不用分析浏览器的请求

2.1 关闭HtmlUnit日志

java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);

3. 使用

依赖: https://search.maven.org/artifact/net.sourceforge.htmlunit/htmlunit

<dependency>
  <groupId>net.sourceforge.htmlunit</groupId>
  <artifactId>htmlunit</artifactId>
  <version>2.58.0</version>
</dependency>

3.1 抓取IT之家周榜内容 - 单页面

抓取IT之家周榜的内容

/**
     * IT之家
     */
    @Test
    @SneakyThrows
    public void test10() {
        
        //浏览器设置
        WebClient webClient = new WebClient();
        webClient.setAjaxController(new NicelyResynchronizingAjaxController());
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setCssEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setActiveXNative(false);
        
        //打开页面
        HtmlPage page = webClient.getPage("https://www.ithome.com/");

        //鼠标悬浮到周榜上
        DomElement inputEle = page.getFirstByXPath("//div[@id='rank']//li[@data-id='2']");
        page = (HtmlPage) inputEle.mouseOver();
        DomElement ulElement = page.getFirstByXPath("//div[@id='rank']//ul[@id='d-2']");

        //周榜信息
        System.out.println(ulElement.asNormalizedText());

    }

抓取成功

3.2 抓取IT之家周榜第九篇文章的内容 - 双页面

/**
     * IT之家周榜第九篇内容
     */
    @Test
    @SneakyThrows
    public void test11() {

        WebClient webClient = new WebClient();
        webClient.setAjaxController(new NicelyResynchronizingAjaxController());
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setCssEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setActiveXNative(false);

        HtmlPage page = webClient.getPage("https://www.ithome.com/");

        //鼠标悬浮到周榜上
        DomElement inputEle = page.getFirstByXPath("//div[@id='rank']//li[@data-id='2']");
        page = (HtmlPage) inputEle.mouseOver();
        
        //获取文章链接
        List<DomElement> articleLinkElems = page.getByXPath("//div[@id='rank']//ul[@id='d-2']//a");
        if(CollUtil.isNotEmpty(articleLinkElems)) {
            //第九篇文章
            page = articleLinkElems.get(8).click();
            DomElement articleDivElem = page.getFirstByXPath("//div[@id='dt']//div[@class='fl content']");
            System.out.println(articleDivElem.asNormalizedText());
        }

    }

抓取成功

3.3 模拟用户操作 - (这个功能个人感觉非常非常的鸡肋,只能用于非常简单的JS,但是一般网站的动作触发都会进行一系列复杂的JS操作,所以想爬虫还是推荐Selenium)

示例页面

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>HtmlUnit测试</title>
</head>

<body>

    <form id="form" onclick="return false;">
        <div class="container">

            <input type="hidden" placeholder="Enter Username" name="mark" id="mark" required value="ajax手动提交">
            <label for="uname"><b>账号</b></label>
            <input type="text" placeholder="Enter Username" name="uname" id="uname" required>

            <label for="psw"><b>密码</b></label>
            <input type="password" placeholder="Enter Password" name="psw"  id="psw" required>

            <button id="loginBtn" type="button">登陆</button>

        </div>

    </form>

    <form id="form2" method="post" action="http://127.0.0.1:8080/login">
        <div class="container">
            <input type="hidden" placeholder="Enter Username" name="mark" id="mark2" required value="form表单提交">
            <label for="uname"><b>账号2</b></label>
            <input type="text" placeholder="Enter Username" name="uname" id="uname2" required>

            <label for="psw"><b>密码2</b></label>
            <input type="password" placeholder="Enter Password" name="psw"  id="psw2" required>

            <button id="loginBtn2" type="submit">登陆2</button>

        </div>

    </form>

</body>

<script src="file:///G:/VsCode/开源/jquery-3.5.1/jquery-3.5.1.min.js"></script>
<script>
    $(function () {

        //登陆
        function loginOperation() {
            $.post("http://127.0.0.1:8080/login",$("#form").serialize(),responseData => {
                $("body").append(`<h1>${JSON.stringify(responseData)}</h1>`)
                $("form").hide();
            },"json")

            return false;
        }

        $("#loginBtn").click(loginOperation);

    })
</script>

</html>

登录接口代码 == springboot == 注意下面是两个文件的代码

@Configuration
public class SystemConfig {
    //允许跨域
    @Bean
    public CorsFilter corsFilter() {
        CorsConfiguration corsConfiguration = new CorsConfiguration();
        corsConfiguration.addAllowedOriginPattern("*");
        corsConfiguration.setAllowCredentials(true);
        corsConfiguration.addAllowedMethod("*");
        corsConfiguration.addAllowedHeader("*");
        UrlBasedCorsConfigurationSource configSource = new UrlBasedCorsConfigurationSource();
        configSource.registerCorsConfiguration("/**", corsConfiguration);
        return new CorsFilter(configSource);
    }
}


@Controller
@RequestMapping
@ResponseBody
public class LoginController {
    @PostMapping("login")
    public Map login(HttpServletRequest request) {
        Map parameterMap = new HashMap(request.getParameterMap());
        parameterMap.put("name", "嗯嗯*");
        return parameterMap;
    }

}

模拟用户表单操作

/**
     * 模拟用户输入
     */
    @Test
    @SneakyThrows
    public void test12() {

        WebClient webClient = new WebClient();
        webClient.setAjaxController(new NicelyResynchronizingAjaxController());
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setCssEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setActiveXNative(false);

        //ajax手动提交的请求
        HtmlPage page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html");

        DomElement loginNameElem = page.getElementById("uname");
        loginNameElem.setAttribute("value", "root");
        DomElement passwordElem = page.getElementById("psw");
        passwordElem.setAttribute("value", "pswroot");

        //提交form1的表单
        DomElement startLoginBtnElem = page.getElementById("loginBtn");
        page = startLoginBtnElem.click();

        DomElement userInfoDivElem = page.getFirstByXPath("//h1");
        System.out.println(userInfoDivElem.asNormalizedText());

        //==================================================

        //表单提交 == 返回的是JSON结果的页面,不是htmlPage页面故需要将结果转成UnexpectedPage
        page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html");
        HtmlInput inputloginNameElem = (HtmlInput) page.getElementById("uname2");
        inputloginNameElem.setAttribute("value", "root2");
        HtmlInput inputpasswordElem = (HtmlInput) page.getElementById("psw2");
        inputpasswordElem.setAttribute("value", "pswroot2");

        //提交form2的表单
        HtmlForm enclosingForm = inputloginNameElem.getEnclosingForm();
        UnexpectedPage page2 = webClient.getPage(enclosingForm.getWebRequest(null));

        //获取响应结果
        System.out.println(page2.getWebResponse().getContentAsString(UTF_8));
    }

3.4 文件下载

<!DOCTYPE html>
<html lang="en">

	<head>
		<meta charset="UTF-8">
		<meta http-equiv="X-UA-Compatible" content="IE=edge">
		<meta name="viewport" content="width=device-width, initial-scale=1.0">
		<title>HtmlUnit测试</title>
	</head>

	<body>

		<form id="form" onclick="return false;">
			<div class="container">

				<input type="hidden" placeholder="Enter Username" name="mark" id="mark" required value="ajax手动提交">
				<label for="uname"><b>账号</b></label>
				<input type="text" placeholder="Enter Username" name="uname" id="uname" required>

				<label for="psw"><b>密码</b></label>
				<input type="password" placeholder="Enter Password" name="psw" id="psw" required>

				<button id="loginBtn" type="button">登陆</button>

			</div>

		</form>

		<form id="form2" method="post" action="http://127.0.0.1:8080/login">
			<div class="container">
				<input type="hidden" placeholder="Enter Username" name="mark" id="mark2" required value="form表单提交">
				<label for="uname"><b>账号2</b></label>
				<input type="text" placeholder="Enter Username" name="uname" id="uname2" required>

				<label for="psw"><b>密码2</b></label>
				<input type="password" placeholder="Enter Password" name="psw" id="psw2" required>

				<button id="loginBtn2" type="submit">登陆2</button>
			</div>
		</form>
		
		
		
		<a href="http://127.0.0.1:8080/download" id="downloadBtn">下载按钮(当前页面)</a>
		<br/>
		<a href="http://127.0.0.1:8080/download" id="downloadBtn2" target="_blank">下载按钮2(新页面)</a>

	</body>

	<script src="file:///G:/VsCode/开源/jquery-3.5.1/jquery-3.5.1.min.js"></script>
	<script>
		$(function() {

			//登陆
			function loginOperation() {
				$.post("http://127.0.0.1:8080/login", $("#form").serialize(), responseData => {
					$("body").append(`<h1>${JSON.stringify(responseData)}</h1>`)
					$("form").hide();
				}, "json")

				return false;
			}

			$("#loginBtn").click(loginOperation);
			

		})
	</script>

</html>

文件下载接口

package work.linruchang.qq.htmlunitweb.controller;

import cn.hutool.core.util.StrUtil;
import lombok.SneakyThrows;
import org.springframework.core.io.FileSystemResource;
import org.springframework.http.HttpHeaders;
import org.springframework.http.MediaType;
import org.springframework.http.ResponseEntity;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.ResponseBody;

import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import java.net.URLEncoder;
import java.util.HashMap;
import java.util.Map;

/**
 * 作用:
 *
 * @author LinRuChang
 * @version 1.0
 * @date 2022/02/09
 * @since 1.8
 **/
@Controller
@RequestMapping
@ResponseBody
public class HtmlUnitController {

    /**
     * 下载文件测试
     * http://127.0.0.1:8080/download
     * @param request
     * @param httpServletResponse
     * @return
     */
    @GetMapping("download")
    @SneakyThrows
    public ResponseEntity login(HttpServletRequest request, HttpServletResponse httpServletResponse) {

        System.out.println(request.getSession().getId() + "开始下载");

        FileSystemResource fileSystemResource = new FileSystemResource("E:\\微信\\文件\\WeChat Files\\wxid_n7xzf77wr3wv22\\FileStorage\\File\\2022-02\\房东符金瑞名下楼栋需要批量处理.xlsx");

        HttpHeaders headers = new HttpHeaders();
        headers.add("Cache-Control", "no-cache, no-store, must-revalidate");
        headers.add("Content-Disposition", StrUtil.format("attachment; filename={}", URLEncoder.encode(fileSystemResource.getFilename())));
        headers.add("Pragma", "no-cache");
        headers.add("Expires", "0");

        return ResponseEntity.ok()
                .headers(headers)
                .contentLength(fileSystemResource.contentLength())
                .contentType(MediaType.parseMediaType("application/octet-stream"))
                .body(fileSystemResource);
    }

}

开始测试HtmlUnit下载功能

package work.linruchang.qq;

import cn.hutool.core.collection.CollUtil;
import cn.hutool.core.io.IoUtil;
import cn.hutool.core.lang.Console;
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import com.gargoylesoftware.htmlunit.javascript.host.event.KeyboardEvent;
import lombok.SneakyThrows;
import org.junit.Test;

import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URLDecoder;
import java.util.List;
import java.util.logging.Level;

import static java.nio.charset.StandardCharsets.UTF_8;

/**
 * 作用:
 *
 * @author LinRuChang
 * @version 1.0
 * @date 2022/02/08
 * @since 1.8
 **/
public class HtmlUnitTest {

    @Test
    @SneakyThrows
    public void test13() {

        WebClient webClient = new WebClient();
        webClient.setAjaxController(new NicelyResynchronizingAjaxController());
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setCssEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setActiveXNative(false);

        HtmlPage page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html");
        //DomElement downloadBtn = page.getElementById("downloadBtn");
        DomElement downloadBtn = page.getElementById("downloadBtn2");
        //触发下载按钮
        Page clickPage = downloadBtn.click();

        //下面两句是等价
        //Page enclosedPage = webClient.getWebWindows().get(webClient.getWebWindows().size() - 1).getEnclosedPage();
        Page enclosedPage = clickPage.getEnclosingWindow().getEnclosedPage();

        InputStream contentAsStream = enclosedPage.getWebResponse().getContentAsStream();
        
        //获取文件名
        String responseHeaderValue = enclosedPage.getWebResponse().getResponseHeaderValue(HttpHeader.CONTENT_DISPOSITION);
        String documentName = responseHeaderValue.split(";")[1].split("=")[1].trim();
        documentName = URLDecoder.decode(documentName);
        Console.log("文件下载成功:{}",documentName);
        
        //存入数据库
        IoUtil.copy(contentAsStream, new FileOutputStream("C:\\Users\\Administrator\\Desktop\\图片\\"+ documentName));

    }

}

3.5 弹框处理

示例页面

<!DOCTYPE html>
<html lang="en">

	<head>
		<meta charset="UTF-8">
		<meta http-equiv="X-UA-Compatible" content="IE=edge">
		<meta name="viewport" content="width=device-width, initial-scale=1.0">
		<title>HtmlUnit测试</title>
	</head>

	<body>

		<form id="form" onclick="return false;">
			<div class="container">

				<input type="hidden" placeholder="Enter Username" name="mark" id="mark" required value="ajax手动提交">
				<label for="uname"><b>账号</b></label>
				<input type="text" placeholder="Enter Username" name="uname" id="uname" required>

				<label for="psw"><b>密码</b></label>
				<input type="password" placeholder="Enter Password" name="psw" id="psw" required>

				<button id="loginBtn" type="button">登陆</button>

			</div>

		</form>

		<form id="form2" method="post" action="http://127.0.0.1:8080/login">
			<div class="container">
				<input type="hidden" placeholder="Enter Username" name="mark" id="mark2" required value="form表单提交">
				<label for="uname"><b>账号2</b></label>
				<input type="text" placeholder="Enter Username" name="uname" id="uname2" required>

				<label for="psw"><b>密码2</b></label>
				<input type="password" placeholder="Enter Password" name="psw" id="psw2" required>

				<button id="loginBtn2" type="submit">登陆2</button>
			</div>
		</form>
		
		
		
		<a href="http://127.0.0.1:8080/download" id="downloadBtn">下载按钮(当前页面)</a>
		<br/>
		<a href="http://127.0.0.1:8080/download" id="downloadBtn2" target="_blank">下载按钮2(新页面)</a>

		<br/>
		<button id="alertBtn">弹出信息</button>
		
		<br/>
		<button id="promptBtn">提示框信息</button>
		
		<br/>
		<button id="confirmBtn">确认框信息</button>		
		
	</body>

	<script src="file:///G:/VsCode/开源/jquery-3.5.1/jquery-3.5.1.min.js"></script>
	<script>
		$(function() {
			var i = 0;
			$("#alertBtn").click(function() {
				alert("点击触发弹框信息: 第" + ++i + "次")
			})
			
			var j = 0;
			$("#promptBtn").click(function() {
				prompt("点击触发提示框信息: 第" + ++j + "次", "默认值1111")
			})
			
			
			var k = 0;
			$("#confirmBtn").click(function() {
				confirm("点击触发确认框信息: 第" + ++k + "次")
			})			
			
			
			//登陆
			function loginOperation() {
				$.post("http://127.0.0.1:8080/login", $("#form").serialize(), responseData => {
					$("body").append(`<h1>${JSON.stringify(responseData)}</h1>`)
					$("form").hide();
				}, "json")

				return false;
			}

			$("#loginBtn").click(loginOperation);
			

		})
	</script>

</html>

HtmlUnit模拟用户触发弹框

@Test
    @SneakyThrows
    public void test15() {

        WebClient webClient = new WebClient();
        webClient.setAjaxController(new NicelyResynchronizingAjaxController());
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setCssEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setActiveXNative(false);

        List<String> alertInfos = new ArrayList<>();
        webClient.setAlertHandler(new CollectingAlertHandler(alertInfos));
        
        //提示框处理
        final List<String> promptInfos = new ArrayList<>();
        webClient.setPromptHandler(new PromptHandler() {
            @Override
            public String handlePrompt(Page page, String message, String defaultValue) {
                Console.log("Prompt信息:{}、{}", message,defaultValue);
                promptInfos.add(message);
                return StrUtil.blankToDefault(message,defaultValue);
            }
        });

        //确认框消息处理
        final List<String> confirmInfos = new ArrayList<>();
        webClient.setConfirmHandler(new ConfirmHandler() {
            @Override
            public boolean handleConfirm(Page page, String message) {
                confirmInfos.add(message);
                //true确认 false取消弹框
                return true;
            }
        });


        HtmlPage page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html");
        DomElement alertBtn = page.getElementById("alertBtn");
        page = alertBtn.click();
        

        DomElement promptBtn = page.getElementById("promptBtn");
        page = promptBtn.click();
        page = promptBtn.click();

        DomElement confirmBtn = page.getElementById("confirmBtn");
        page = confirmBtn.click();
        page = confirmBtn.click();
        page = confirmBtn.click();

        Console.log("弹框信息:{}", alertInfos);
        Console.log("提示框信息:{}", promptInfos);
        Console.log("确认框信息:{}", confirmInfos);

    }

相关文章