本文整理了Java中org.jsoup.parser.Parser.htmlParser()
方法的一些代码示例,展示了Parser.htmlParser()
的具体用法。这些代码示例主要来源于Github
/Stackoverflow
/Maven
等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。Parser.htmlParser()
方法的具体详情如下:
包路径:org.jsoup.parser.Parser
类名称:Parser
方法名:htmlParser
[英]Create a new HTML parser. This parser treats input as HTML5, and enforces the creation of a normalised document, based on a knowledge of the semantics of the incoming tags.
[中]创建一个新的HTML解析器。该解析器将输入视为HTML5,并根据传入标记的语义知识强制创建规范化文档。
代码示例来源:origin: org.jsoup/jsoup
/**
* Loads a file to a Document.
* @param in file to load
* @param charsetName character set of input
* @param baseUri base URI of document, to resolve relative links against
* @return Document
* @throws IOException on IO error
*/
public static Document load(File in, String charsetName, String baseUri) throws IOException {
return parseInputStream(new FileInputStream(in), charsetName, baseUri, Parser.htmlParser());
}
代码示例来源:origin: org.jsoup/jsoup
/**
* Parses a Document from an input steam.
* @param in input stream to parse. You will need to close it.
* @param charsetName character set of input
* @param baseUri base URI of document, to resolve relative links against
* @return Document
* @throws IOException on IO error
*/
public static Document load(InputStream in, String charsetName, String baseUri) throws IOException {
return parseInputStream(in, charsetName, baseUri, Parser.htmlParser());
}
代码示例来源:origin: org.jsoup/jsoup
Request() {
timeoutMilliseconds = 30000; // 30 seconds
maxBodySizeBytes = 1024 * 1024; // 1MB
followRedirects = true;
data = new ArrayList<>();
method = Method.GET;
addHeader("Accept-Encoding", "gzip");
addHeader(USER_AGENT, DEFAULT_UA);
parser = Parser.htmlParser();
}
代码示例来源:origin: com.vaadin/vaadin-server
/**
* Parses the given input stream into a jsoup document
*
* @param html
* the stream containing the design
* @return the parsed jsoup document
* @throws IOException
*/
private static Document parse(InputStream html) {
try {
Document doc = Jsoup.parse(html, UTF_8.name(), "",
Parser.htmlParser());
return doc;
} catch (IOException e) {
throw new DesignException("The html document cannot be parsed.");
}
}
代码示例来源:origin: rakam-io/rakam
Document parse = Jsoup.parse(content, "", Parser.htmlParser());
代码示例来源:origin: fivesmallq/web-data-extractor
/**
* change parser to htmlParser.
*
* @return
*/
public SelectorExtractor htmlParser() {
this.parser = Parser.htmlParser();
return this;
}
代码示例来源:origin: com.norconex.collectors/norconex-importer
/**
* Gets the JSoup parser associated with the string representation.
* The string "xml" (case insensitive) will return the XML parser.
* Anything else will return the HTML parser.
* @param parser "html" or "xml"
* @return JSoup parser
* @since 2.8.0
*/
public static Parser toJSoupParser(String parser) {
if ("xml".equalsIgnoreCase(parser)) {
return Parser.xmlParser();
}
return Parser.htmlParser();
}
代码示例来源:origin: abola/CrawlerPack
/**
* 將 HTML 轉化為 Jsoup Document 物件
*
* HTML的內容就使用Jsoup原生的 HTML Parser
*
* @param html Html document
* @return org.jsoup.nodes.Document
*/
public org.jsoup.nodes.Document htmlToJsoupDoc(String html){
// 將 html(html/html5) 轉為 jsoup Document 物件
Document jsoupDoc = Jsoup.parse(html, "UTF-8", Parser.htmlParser() );
jsoupDoc.charset(StandardCharsets.UTF_8);
return jsoupDoc;
}
代码示例来源:origin: addthis/hydra
Parser parser = Parser.htmlParser().setTrackErrors(0);
@Nonnull Document doc = parser.parseInput(html, "");
@Nonnull Elements tags = doc.select(tagName);
代码示例来源:origin: org.apache.any23/apache-any23-core
return Jsoup.parse(input, encoding, documentIRI, Parser.htmlParser());
代码示例来源:origin: DigitalPebble/storm-crawler
/**
* Attempt to find a META tag in the HTML that hints at the character set
* used to write the document.
*/
private static String getCharsetFromMeta(byte buffer[], int maxlength) {
// convert to UTF-8 String -- which hopefully will not mess up the
// characters we're interested in...
int len = buffer.length;
if (maxlength > 0 && maxlength < len) {
len = maxlength;
}
String html = new String(buffer, 0, len, DEFAULT_CHARSET);
Document doc = Parser.htmlParser().parseInput(html, "dummy");
// look for <meta http-equiv="Content-Type"
// content="text/html;charset=gb2312"> or HTML5 <meta charset="gb2312">
Elements metaElements = doc
.select("meta[http-equiv=content-type], meta[charset]");
String foundCharset = null;
for (Element meta : metaElements) {
if (meta.hasAttr("http-equiv"))
foundCharset = getCharsetFromContentType(meta.attr("content"));
if (foundCharset == null && meta.hasAttr("charset"))
foundCharset = meta.attr("charset");
if (foundCharset != null)
return foundCharset;
}
return foundCharset;
}
代码示例来源:origin: DigitalPebble/storm-crawler
.decode(ByteBuffer.wrap(content)).toString();
jsoupDoc = Parser.htmlParser().parseInput(html, url);
代码示例来源:origin: DigitalPebble/storm-crawler
@Test
public void testExclusionCase() throws IOException {
Config conf = new Config();
conf.put(TextExtractor.EXCLUDE_PARAM_NAME, "style");
TextExtractor extractor = new TextExtractor(conf);
String content = "<html>the<STYLE>main</STYLE>content of the page</html>";
Document jsoupDoc = Parser.htmlParser().parseInput(content,
"http://stormcrawler.net");
String text = extractor.text(jsoupDoc.body());
assertEquals("the content of the page", text);
}
代码示例来源:origin: DigitalPebble/storm-crawler
@Test
public void testMainContent() throws IOException {
Config conf = new Config();
conf.put(TextExtractor.INCLUDE_PARAM_NAME, "DIV[id=\"maincontent\"]");
TextExtractor extractor = new TextExtractor(conf);
String content = "<html>the<div id='maincontent'>main<div>content</div></div>of the page</html>";
Document jsoupDoc = Parser.htmlParser().parseInput(content,
"http://stormcrawler.net");
String text = extractor.text(jsoupDoc.body());
assertEquals("main content", text);
}
代码示例来源:origin: DigitalPebble/storm-crawler
@Test
public void testExclusion() throws IOException {
Config conf = new Config();
conf.put(TextExtractor.EXCLUDE_PARAM_NAME, "STYLE");
TextExtractor extractor = new TextExtractor(conf);
String content = "<html>the<style>main</style>content of the page</html>";
Document jsoupDoc = Parser.htmlParser().parseInput(content,
"http://stormcrawler.net");
String text = extractor.text(jsoupDoc.body());
assertEquals("the content of the page", text);
}
内容来源于网络,如有侵权,请联系作者删除!