本文整理了Java中org.htmlparser.Parser.parse()
方法的一些代码示例,展示了Parser.parse()
的具体用法。这些代码示例主要来源于Github
/Stackoverflow
/Maven
等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。Parser.parse()
方法的具体详情如下:
包路径:org.htmlparser.Parser
类名称:Parser
方法名:parse
[英]Parse the given resource, using the filter provided. This can be used to extract information from specific nodes. When used with a null
filter it returns an entire page which can then be modified and converted back to HTML (Note: the synthesis use-case is not handled very well; the parser is more often used to extract information from a web page).
For example, to replace the entire contents of the HEAD with a single TITLE tag you could do this:
NodeList nl = parser.parse (null); // here is your two node list
NodeList heads = nl.extractAllNodesThatMatch (new TagNameFilter ("HEAD"))
if (heads.size () > 0) // there may not be a HEAD tag
{
Head head = heads.elementAt (0); // there should be only one
head.removeAll (); // clean out the contents
Tag title = new TitleTag ();
title.setTagName ("title");
title.setChildren (new NodeList (new TextNode ("The New Title")));
Tag title_end = new TitleTag ();
title_end.setTagName ("/title");
title.setEndTag (title_end);
head.add (title);
}
System.out.println (nl.toHtml ()); // output the modified HTML
[中]使用提供的过滤器解析给定的资源。这可用于从特定节点提取信息。当与null
过滤器一起使用时,它会返回整个页面,然后可以对其进行修改并转换回HTML(注意:合成用例处理得不是很好;解析器更常用于从网页中提取信息)。
例如,要用单个标题标记替换头部的全部内容,可以执行以下操作:
NodeList nl = parser.parse (null); // here is your two node list
NodeList heads = nl.extractAllNodesThatMatch (new TagNameFilter ("HEAD"))
if (heads.size () > 0) // there may not be a HEAD tag
{
Head head = heads.elementAt (0); // there should be only one
head.removeAll (); // clean out the contents
Tag title = new TitleTag ();
title.setTagName ("title");
title.setChildren (new NodeList (new TextNode ("The New Title")));
Tag title_end = new TitleTag ();
title_end.setTagName ("/title");
title.setEndTag (title_end);
head.add (title);
}
System.out.println (nl.toHtml ()); // output the modified HTML
代码示例来源:origin: com.rogiel.httpchannel/httpchannel-util
private HTMLPage(Parser parser) throws ParserException {
this.nodes = parser.parse(null);
}
代码示例来源:origin: org.fitnesse/fitnesse
private NodeList parseHtml(String possibleTable) {
try {
Parser parser = new Parser(possibleTable);
return parser.parse(null);
} catch (ParserException | StringIndexOutOfBoundsException e) {
return null;
}
}
代码示例来源:origin: com.github.tcnh/fitnesse
private NodeList parseHtml(String possibleTable) {
try {
Parser parser = new Parser(possibleTable);
return parser.parse(null);
} catch (ParserException e) {
return null;
}
}
代码示例来源:origin: riotfamily/riot
public void parse() throws ParserException {
Parser parser = new Parser();
parser.setInputHTML(html);
nodes = parser.parse(null);
}
代码示例来源:origin: org.apache.uima/ruta-ep-ide-ui
private void fillMap(String documentationFile) throws IOException {
InputStream resourceAsStream = getClass().getResourceAsStream(documentationFile);
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(resourceAsStream));
StringBuilder sb = new StringBuilder();
while (true) {
String line;
line = reader.readLine();
if (line == null) {
break;
}
sb.append(line + "\n");
}
String document = sb.toString();
Parser parser = new Parser(document);
NodeList list = parser.parse(null);
HtmlDocumentationVisitor visitor = new HtmlDocumentationVisitor(document);
list.visitAllNodesWith(visitor);
map.putAll(visitor.getMap());
} catch (Exception e) {
RutaIdeUIPlugin.error(e);
}
}
}
代码示例来源:origin: xuyisheng/TextViewForFullHtml
public static String parseFontHTML(String content) {
hasData = false;
Parser parser = Parser.createParser(content, "UTF-8");
StringBuilder sb = null;
try {
NodeList list = (NodeList) parser.parse(null);
if (hasFont(list)) {
sb = getNewHtml(list);
}
} catch (ParserException e) {
e.printStackTrace();
}
if (sb == null) {
return content;
}
return sb.toString().replace("</FONT></FONT></FONT>", "</FONT>").replace("</FONT></FONT>", "</FONT>");
}
代码示例来源:origin: org.htmlparser/htmlparser
/**
* Apply each of the filters.
* The first filter is applied to the output of the parser.
* Subsequent filters are applied to the output of the prior filter.
* @return A list of nodes passed through all filters.
* If there are no filters, returns the entire page.
* @throws ParserException If an encoding change occurs
* or there is some other problem.
*/
protected NodeList applyFilters ()
throws
ParserException
{
NodeFilter[] filters;
NodeList ret;
ret = mParser.parse (null);
filters = getFilters ();
if (null != filters)
for (int i = 0; i < filters.length; i++)
ret = ret.extractAllNodesThatMatch (filters[i], mRecursive);
return (ret);
}
代码示例来源:origin: CloudSlang/cs-actions
private void processHTMLBodyWithBASE64Images(MimeMultipart multipart) throws ParserException,
MessagingException, NoSuchAlgorithmException, SMIMEException, java.security.NoSuchProviderException {
if (null != body && body.contains("base64")) {
Parser parser = new Parser(body);
NodeList nodeList = parser.parse(null);
HtmlImageNodeVisitor htmlImageNodeVisitor = new HtmlImageNodeVisitor();
nodeList.visitAllNodesWith(htmlImageNodeVisitor);
body = nodeList.toHtml();
addAllBase64ImagesToMimeMultipart(multipart, htmlImageNodeVisitor.getBase64Images());
}
}
代码示例来源:origin: com.github.tcnh/fitnesse
public HtmlTableScanner(String page) {
if (page == null || page.equals(""))
page = "<i>This page intentionally left blank.</i>";
NodeList htmlTree;
try {
Parser parser = new Parser(new Lexer(new Page(page)));
htmlTree = parser.parse(null);
} catch (ParserException e) {
throw new SlimError(e);
}
scanForTables(htmlTree);
}
代码示例来源:origin: org.fitnesse/fitnesse
public HtmlTableScanner(String page) {
if (page == null || page.equals(""))
page = "<i>This page intentionally left blank.</i>";
NodeList htmlTree;
try {
Parser parser = new Parser(new Lexer(new Page(page)));
htmlTree = parser.parse(null);
} catch (ParserException e) {
throw new SlimError(e);
}
scanForTables(htmlTree);
}
代码示例来源:origin: ScienJus/pixiv-crawler
/**
* 提取多张图片
* @param pageHtml
* @return
*/
public List<String> parseManga(String pageHtml) {
try {
List<String> result = new ArrayList<String>();
Parser parser = new Parser(pageHtml);
NodeFilter filter = new AndFilter(new TagNameFilter("div"),new HasAttributeFilter("class","item-container"));
NodeList list = parser.parse(filter);
for (int i = 0; i < list.size(); i++) {
Node item = list.elementAt(i);
result.add(((ImageTag) item.getChildren().elementAt(2)).getAttribute("data-src"));
}
return result;
} catch (ParserException e) {
logger.error(e.getMessage());
}
return null;
}
代码示例来源:origin: org.apache.uima/ruta-core
@Override
public void process(JCas jcas) throws AnalysisEngineProcessException {
String documentText = jcas.getDocumentText();
List<AnnotationFS> annotations = new ArrayList<AnnotationFS>();
List<AnnotationFS> annotationStack = new ArrayList<AnnotationFS>();
try {
Parser parser = new Parser(documentText);
NodeList list = parser.parse(null);
HtmlVisitor visitor = new HtmlVisitor(jcas, onlyContent);
list.visitAllNodesWith(visitor);
annotations = visitor.getAnnotations();
annotationStack = visitor.getAnnotationStack();
} catch (ParserException e) {
throw new AnalysisEngineProcessException(e);
}
for (AnnotationFS each : annotations) {
if (each.getBegin() < each.getEnd()) {
jcas.addFsToIndexes(each);
}
}
for (AnnotationFS each : annotationStack) {
if (each.getBegin() < each.getEnd()) {
jcas.addFsToIndexes(each);
}
}
}
代码示例来源:origin: org.fitnesse/fitnesse
private NodeList getMatchingTags(NodeFilter filter) throws Exception {
String html = examiner.html();
Parser parser = new Parser(new Lexer(new Page(html)));
NodeList list = parser.parse(null);
NodeList matches = list.extractAllNodesThatMatch(filter, true);
return matches;
}
代码示例来源:origin: org.fitnesse/fitnesse
private NodeList makeNodeList(TestPage pageToTest) {
String html = pageToTest.getHtml();
Parser parser = new Parser(new Lexer(new Page(html)));
try {
return parser.parse(null);
} catch (ParserException e) {
throw new SlimError(e);
}
}
代码示例来源:origin: com.bbossgroups/bboss-htmlparser
/**
* Apply each of the filters.
* The first filter is applied to the parser.
* Subsequent filters are applied to the output of the prior filter.
* @return A list of nodes passed through all filters.
* @throws ParserException If an encoding change occurs
* or there is some other problem.
*/
protected NodeList applyFilters ()
throws
ParserException
{
NodeList ret;
ret = new NodeList ();
if (null != getFilters ())
for (int i = 0; i < getFilters ().length; i++)
if (0 == i)
ret = mParser.parse (getFilters ()[0]);
else
ret = ret.extractAllNodesThatMatch (getFilters ()[i]);
return (ret);
}
代码示例来源:origin: org.apache.uima/textmarker-core
@Override
public void process(JCas jcas) throws AnalysisEngineProcessException {
String documentText = jcas.getDocumentText();
List<AnnotationFS> annotations = new ArrayList<AnnotationFS>();
List<AnnotationFS> annotationStack = new ArrayList<AnnotationFS>();
try {
Parser parser = new Parser(documentText);
NodeList list = parser.parse(null);
HtmlVisitor visitor = new HtmlVisitor(jcas, onlyContent);
list.visitAllNodesWith(visitor);
annotations = visitor.getAnnotations();
annotationStack = visitor.getAnnotationStack();
} catch (ParserException e) {
throw new AnalysisEngineProcessException(e);
}
for (AnnotationFS each : annotations) {
if (each.getBegin() < each.getEnd()) {
jcas.addFsToIndexes(each);
}
}
for (AnnotationFS each : annotationStack) {
if (each.getBegin() < each.getEnd()) {
jcas.addFsToIndexes(each);
}
}
}
代码示例来源:origin: com.github.tcnh/fitnesse
private NodeList makeNodeList(TestPage pageToTest) {
String html = pageToTest.getHtml();
Parser parser = new Parser(new Lexer(new Page(html)));
try {
return parser.parse(null);
} catch (ParserException e) {
throw new SlimError(e);
}
}
代码示例来源:origin: com.github.tcnh/fitnesse
private NodeList getMatchingTags(NodeFilter filter) throws Exception {
String html = examiner.html();
Parser parser = new Parser(new Lexer(new Page(html)));
NodeList list = parser.parse(null);
NodeList matches = list.extractAllNodesThatMatch(filter, true);
return matches;
}
代码示例来源:origin: ScienJus/pixiv-crawler
/**
* 提取单张图片
* @param pageHtml
* @return
*/
public String parseMedium(String pageHtml) {
try {
Parser parser = new Parser(pageHtml);
NodeFilter filter = new AndFilter(new TagNameFilter("img"),new HasAttributeFilter("class","original-image"));
NodeList list = parser.parse(filter);
if (list.size() > 0) {
return ((ImageTag)list.elementAt(0)).getAttribute("data-src");
}
} catch (ParserException e) {
logger.error(e.getMessage());
}
return null;
}
代码示例来源:origin: ScienJus/pixiv-crawler
/**
* 在搜索列表中找到下一页的地址
* @param pageHtml
* @return
*/
public String parseNextPage(String pageHtml) {
try {
Parser parser = new Parser(pageHtml);
NodeFilter filter = new AndFilter(new TagNameFilter("a"),new HasAttributeFilter("rel","next"));
NodeList list = parser.parse(filter);
if(list.size() > 0) {
return ((LinkTag)list.elementAt(0)).getLink();
}
} catch (ParserException e) {
logger.error(e.getMessage());
}
return null;
}
内容来源于网络,如有侵权,请联系作者删除!