本文整理了Java中org.htmlparser.Parser
类的一些代码示例,展示了Parser
类的具体用法。这些代码示例主要来源于Github
/Stackoverflow
/Maven
等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。Parser
类的具体详情如下:
包路径:org.htmlparser.Parser
类名称:Parser
[英]The main parser class. This is the primary class of the HTML Parser library. It provides constructors that take a #Parser(String), a #Parser(URLConnection), or a #Parser(Lexer). In the case of a String, an attempt is made to open it as a URL, and if that fails it assumes it is a local disk file. If you want to actually parse a String, use #setInputHTML after using the #Parser() constructor, or use #createParser.
The Parser provides access to the contents of the page, via a #elements(), a #parse(NodeFilter) or a #visitAllNodesWith.
Typical usage of the parser is: ``
Parser parser = new Parser ("http://whatever");
NodeList list = parser.parse ();
// do something with your list of nodes.
``
What types of nodes and what can be done with them is dependant on the setup, but in general a node can be converted back to HTML and it's children (enclosed nodes) and parent can be obtained, because nodes are nested. See the Node interface.
For example, if the URL contains:
``
and the example code above is used, the list contain only one element, the node. This node is a org.htmlparser.tags, which is an object of class org.htmlparser.tags.Html if the default NodeFactory(a PrototypicalNodeFactory) is used.
To get at further content, the children of the top level nodes must be examined. When digging through a node list one must be conscious of the possibility of whitespace between nodes, e.g. in the example above: ``
Node node = list.elementAt (0);
NodeList sublist = node.getChildren ();
System.out.println (sublist.size ());
`` would print out 5, not 2, because there are newlines after , and that are children of the HTML node besides the and nodes.
Because processing nodes is so common, two interfaces are provided to ease this task, org.htmlparser.filtersand org.htmlparser.visitors.
[中]主解析器类。这是HTML解析器库的主要类。它提供了接受#解析器(String)、#解析器(URLConnection)或#解析器(Lexer)的构造函数。对于字符串,尝试将其作为URL打开,如果失败,则假定它是本地磁盘文件。如果要实际解析字符串,请在使用#Parser()构造函数后使用#setInputHTML,或使用#createParser。
解析器通过#elements()、#parse(NodeFilter)或#visitAllNodesWith提供对页面内容的访问。
解析器的典型用法是:``
Parser parser = new Parser ("http://whatever");
NodeList list = parser.parse ();
// do something with your list of nodes.
什么类型的节点以及可以用它们做什么取决于设置,但一般来说,节点可以转换回HTML,并且可以获得它的子节点(封闭的节点)和父节点,因为节点是嵌套的。请参见节点界面。 例如,如果URL包含:
使用上面的示例代码,列表只包含一个元素,即节点。这个节点是一个组织。HTMLPasser。标记,这是类org的一个对象。HTMLPasser。标签。如果使用默认的NodeFactory(原型NodeFactory),则返回Html。
为了获得进一步的内容,必须检查顶层节点的子节点。在挖掘节点列表时,必须意识到节点之间可能存在空白,例如,在上面的示例中:[3$]]
Node node = list.elementAt (0);
NodeList sublist = node.getChildren ();
System.out.println (sublist.size ());
``会打印出5个,而不是2个,因为后面有换行符,它们是HTML节点的子节点,而不是and节点。
因为处理节点非常常见,所以提供了两个接口来简化这项任务,org。HTMLPasser。过滤器和组织。HTMLPasser。访客。
代码示例来源:origin: org.fitnesse/fitnesse
private NodeList parseHtml(String possibleTable) {
try {
Parser parser = new Parser(possibleTable);
return parser.parse(null);
} catch (ParserException | StringIndexOutOfBoundsException e) {
return null;
}
}
代码示例来源:origin: org.opencms/org.opencms.workplace.tools.content
Parser parser = new Parser();
parser.setNodeFactory(m_nodeFactory);
Lexer lexer = new Lexer();
Page page = new Page(html, encoding);
lexer.setPage(page);
parser.setLexer(lexer);
parser.visitAllNodesWith(this);
代码示例来源:origin: oaqa/knn4qa
public PostCleaner(String html, int minCodeChars, boolean excludeCode) {
try {
Parser htmlParser = Parser.createParser(html, "utf8");
PostCleanerVisitor res = new PostCleanerVisitor(minCodeChars, excludeCode);
htmlParser.visitAllNodesWith(res);
mText = res.getText();
} catch (ParserException e) {
System.err.println(" Parser exception: " + e + " trying simple conversion");
// Plan B!!!
mText = PostCleanerVisitor.simpleProc(html);
}
}
代码示例来源:origin: riotfamily/riot
public void parse() throws ParserException {
Parser parser = new Parser();
parser.setInputHTML(html);
nodes = parser.parse(null);
}
代码示例来源:origin: org.htmlparser/htmlparser
/**
* Construct a parser using the provided lexer and feedback object.
* This would be used to create a parser for special cases where the
* normal creation of a lexer on a URLConnection needs to be customized.
* @param lexer The lexer to draw characters from.
* @param fb The object to use when information,
* warning and error messages are produced. If <em>null</em> no feedback
* is provided.
*/
public Parser (Lexer lexer, ParserFeedback fb)
{
setFeedback (fb);
setLexer (lexer);
setNodeFactory (new PrototypicalNodeFactory ());
}
代码示例来源:origin: com.bbossgroups.pdp/pdp-cms
/**
* Extract the text from a HTML page.<p>
*
* @param in the html content input stream
* @param encoding the encoding of the content
*
* @return the extracted text from the page
* @throws ParserException if the parsing of the HTML failed
* @throws UnsupportedEncodingException if the given encoding is not supported
*/
public static String extractText(InputStream in, String encoding)
throws ParserException, UnsupportedEncodingException {
Parser parser = new Parser();
Lexer lexer = new Lexer();
Page page = new Page(in, encoding);
lexer.setPage(page);
parser.setLexer(lexer);
StringBean stringBean = new StringBean();
parser.visitAllNodesWith(stringBean);
return stringBean.getStrings();
}
代码示例来源:origin: org.wso2.carbon.automationutils/org.wso2.carbon.integration.common.tests
public static List<String> getLinks(String url) throws ParserException {
Parser htmlParser = new Parser(url);
List<String> links = new LinkedList<String>();
NodeList tagNodeList = htmlParser.extractAllNodesThatMatch(new NodeClassFilter(LinkTag.class));
for (int m = 0; m < tagNodeList.size(); m++) {
LinkTag loopLinks = (LinkTag) tagNodeList.elementAt(m);
String linkName = loopLinks.getLink();
links.add(linkName);
}
return links;
}
代码示例来源:origin: org.htmlparser/htmlparser
/**
* Create a FilterBean object.
*/
public FilterBean ()
{
mPropertySupport = new PropertyChangeSupport (this);
mParser = new Parser ();
mFilters = null;
mNodes = null;
mRecursive = true;
}
代码示例来源:origin: org.htmlparser/htmlparser
mParser = new Parser (url);
else
mParser.setURL (url);
mPropertySupport.firePropertyChange (
PROP_URL_PROPERTY, old, getURL ());
mPropertySupport.firePropertyChange (
PROP_CONNECTION_PROPERTY, conn, mParser.getConnection ());
setStrings ();
代码示例来源:origin: org.htmlparser/htmlparser
mParser = new Parser (connection);
else
mParser.setConnection (connection);
mPropertySupport.firePropertyChange (
PROP_URL_PROPERTY, url, getURL ());
mPropertySupport.firePropertyChange (
PROP_CONNECTION_PROPERTY, conn, mParser.getConnection ());
setStrings ();
代码示例来源:origin: omegat-org/omegat
@Override
public void processFile(BufferedReader infile, BufferedWriter outfile, FilterContext fc) throws IOException,
TranslationException {
StringBuilder all = null;
try {
all = new StringBuilder();
char[] cbuf = new char[1000];
int len = -1;
while ((len = infile.read(cbuf)) > 0) {
all.append(cbuf, 0, len);
}
} catch (OutOfMemoryError e) {
// out of memory?
all = null;
System.gc();
throw new IOException(OStrings.getString("HHC__FILE_TOO_BIG"));
}
Parser parser = new Parser();
try {
parser.setInputHTML(all.toString());
parser.visitAllNodesWith(new HHCFilterVisitor(this, outfile));
} catch (ParserException pe) {
System.out.println(pe);
}
}
代码示例来源:origin: com.bbossgroups/bboss-htmlparser
try
parser = new Parser ();
if (1 < args.length)
filter = new TagNameFilter (args[1]);
parser.setFeedback (Parser.STDOUT);
getConnectionManager ().setMonitor (parser);
parser.setURL (args[0]);
System.out.println (parser.parse (filter));
代码示例来源:origin: fhopf/akka-crawler-example
@Override
public PageContent fetchPageContent(String url) {
logger.debug("Fetching {}", url);
try {
Parser parser = new Parser(url);
PageContentVisitor visitor = new PageContentVisitor(baseUrl, url);
parser.visitAllNodesWith(visitor);
return visitor.getContent();
} catch (ParserException ex) {
throw new IllegalStateException(ex);
}
}
代码示例来源:origin: eu.fbk.utils/utils-lsa
URLConnection con = url.openConnection();
parser = new Parser(con);
NodeList list = parser.extractAllNodesThatMatch(new TagNameFilter("P"));
parser.reset();
代码示例来源:origin: org.htmlparser/htmlparser
System.out.println ("HTML Parser v" + getVersion () + "\n");
System.out.println ();
System.out.println ("Syntax : java -jar htmlparser.jar"
try
parser = new Parser ();
if (1 < args.length)
filter = new TagNameFilter (args[1]);
parser.setFeedback (Parser.STDOUT);
getConnectionManager ().setMonitor (parser);
getConnectionManager ().setRedirectionProcessingEnabled (true);
getConnectionManager ().setCookieProcessingEnabled (true);
parser.setResource (args[0]);
System.out.println (parser.parse (filter));
代码示例来源:origin: ScienJus/pixiv-crawler
try {
List<String> items = new ArrayList<>();
Parser parser = new Parser(pageHtml);
NodeFilter filter = new AndFilter(new TagNameFilter("li"),new HasAttributeFilter("class","image-item"));
NodeList list = parser.parse(filter);
if (list.size() == 0) {
parser.reset();
filter = new AndFilter(new TagNameFilter("li"),new HasAttributeFilter("class","image-item "));
list = parser.parse(filter);
parser.reset();
filter = new TagNameFilter("li");
list = parser.parse(filter);
代码示例来源:origin: org.htmlparser/htmlparser
/**
* Create a Parser Object having a String Object as input (instead of a url or a string representing the url location).
* <BR>The string will be parsed as it would be a file.
* @param input The string in input.
* @return The Parser Object with the string as input stream.
*/
public static Parser createParserParsingAnInputString (String input)
throws ParserException, UnsupportedEncodingException
{
Parser parser = new Parser();
Lexer lexer = new Lexer();
Page page = new Page(input);
lexer.setPage(page);
parser.setLexer(lexer);
return parser;
}
代码示例来源:origin: edu.umd/cloud9
parser.setInputHTML(doc.getContent()); // initializing the
NodeList nl = parser.parse(null);
BaseHrefTag baseTag = new BaseHrefTag();
baseTag.setBaseUrl(base);
parser.setInputHTML(nl.toHtml());
list = parser.extractAllNodesThatMatch(filter);
代码示例来源:origin: deas/alfresco
Parser parser = Parser.createParser(result, "UTF-8");
PrototypicalNodeFactory factory = new PrototypicalNodeFactory();
parser.setNodeFactory(factory);
NodeIterator itr = parser.elements();
processNodes(buf, itr, false, overrideDocumentType);
Parser parser = Parser.createParser(result, "UTF-8");
PrototypicalNodeFactory factory = new PrototypicalNodeFactory();
parser.setNodeFactory(factory);
NodeIterator itr = parser.elements();
processNodes(buf, itr, true);
代码示例来源:origin: org.alfresco/alfresco-repository
mParser = new Parser(newURL);
mParser.setURL(newURL);
mParser.setEncoding(encoding);
mPropertySupport.firePropertyChange(PROP_CONNECTION_PROPERTY, conn, mParser.getConnection());
setStrings();
内容来源于网络,如有侵权,请联系作者删除!