org.htmlparser.Parser.parse()方法的使用及代码示例

x33g5p2x  于2022-01-26 转载在 其他  
字(10.3k)|赞(0)|评价(0)|浏览(171)

本文整理了Java中org.htmlparser.Parser.parse()方法的一些代码示例,展示了Parser.parse()的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。Parser.parse()方法的具体详情如下:
包路径:org.htmlparser.Parser
类名称:Parser
方法名:parse

Parser.parse介绍

[英]Parse the given resource, using the filter provided. This can be used to extract information from specific nodes. When used with a null filter it returns an entire page which can then be modified and converted back to HTML (Note: the synthesis use-case is not handled very well; the parser is more often used to extract information from a web page).

For example, to replace the entire contents of the HEAD with a single TITLE tag you could do this:

NodeList nl = parser.parse (null); // here is your two node list 
NodeList heads = nl.extractAllNodesThatMatch (new TagNameFilter ("HEAD")) 
if (heads.size () > 0) // there may not be a HEAD tag 
{ 
Head head = heads.elementAt (0); // there should be only one 
head.removeAll (); // clean out the contents 
Tag title = new TitleTag (); 
title.setTagName ("title"); 
title.setChildren (new NodeList (new TextNode ("The New Title"))); 
Tag title_end = new TitleTag (); 
title_end.setTagName ("/title"); 
title.setEndTag (title_end); 
head.add (title); 
} 
System.out.println (nl.toHtml ()); // output the modified HTML

[中]使用提供的过滤器解析给定的资源。这可用于从特定节点提取信息。当与null过滤器一起使用时,它会返回整个页面,然后可以对其进行修改并转换回HTML(注意:合成用例处理得不是很好;解析器更常用于从网页中提取信息)。
例如,要用单个标题标记替换头部的全部内容,可以执行以下操作:

NodeList nl = parser.parse (null); // here is your two node list 
NodeList heads = nl.extractAllNodesThatMatch (new TagNameFilter ("HEAD")) 
if (heads.size () > 0) // there may not be a HEAD tag 
{ 
Head head = heads.elementAt (0); // there should be only one 
head.removeAll (); // clean out the contents 
Tag title = new TitleTag (); 
title.setTagName ("title"); 
title.setChildren (new NodeList (new TextNode ("The New Title"))); 
Tag title_end = new TitleTag (); 
title_end.setTagName ("/title"); 
title.setEndTag (title_end); 
head.add (title); 
} 
System.out.println (nl.toHtml ()); // output the modified HTML

代码示例

代码示例来源:origin: com.rogiel.httpchannel/httpchannel-util

private HTMLPage(Parser parser) throws ParserException {
  this.nodes = parser.parse(null);
}

代码示例来源:origin: org.fitnesse/fitnesse

private NodeList parseHtml(String possibleTable) {
 try {
  Parser parser = new Parser(possibleTable);
  return parser.parse(null);
 } catch (ParserException | StringIndexOutOfBoundsException e) {
  return null;
 }
}

代码示例来源:origin: com.github.tcnh/fitnesse

private NodeList parseHtml(String possibleTable) {
 try {
  Parser parser = new Parser(possibleTable);
  return parser.parse(null);
 } catch (ParserException e) {
  return null;
 }
}

代码示例来源:origin: riotfamily/riot

public void parse() throws ParserException {
  Parser parser = new Parser();
  parser.setInputHTML(html);
  nodes = parser.parse(null);
}

代码示例来源:origin: org.apache.uima/ruta-ep-ide-ui

private void fillMap(String documentationFile) throws IOException {
  InputStream resourceAsStream = getClass().getResourceAsStream(documentationFile);
  try {
   BufferedReader reader = new BufferedReader(new InputStreamReader(resourceAsStream));
   StringBuilder sb = new StringBuilder();
   while (true) {
    String line;
    line = reader.readLine();
    if (line == null) {
     break;
    }
    sb.append(line + "\n");
   }

   String document = sb.toString();

   Parser parser = new Parser(document);
   NodeList list = parser.parse(null);
   HtmlDocumentationVisitor visitor = new HtmlDocumentationVisitor(document);
   list.visitAllNodesWith(visitor);
   map.putAll(visitor.getMap());
  } catch (Exception e) {
   RutaIdeUIPlugin.error(e);
  }

 }
}

代码示例来源:origin: xuyisheng/TextViewForFullHtml

public static String parseFontHTML(String content) {
  hasData = false;
  Parser parser = Parser.createParser(content, "UTF-8");
  StringBuilder sb = null;
  try {
    NodeList list = (NodeList) parser.parse(null);
    if (hasFont(list)) {
      sb = getNewHtml(list);
    }
  } catch (ParserException e) {
    e.printStackTrace();
  }
  if (sb == null) {
    return content;
  }
  return sb.toString().replace("</FONT></FONT></FONT>", "</FONT>").replace("</FONT></FONT>", "</FONT>");
}

代码示例来源:origin: org.htmlparser/htmlparser

/**
 * Apply each of the filters.
 * The first filter is applied to the output of the parser.
 * Subsequent filters are applied to the output of the prior filter.
 * @return A list of nodes passed through all filters.
 * If there are no filters, returns the entire page.
 * @throws ParserException If an encoding change occurs
 * or there is some other problem.
 */
protected NodeList applyFilters ()
  throws
    ParserException
{
  NodeFilter[] filters;
  NodeList ret;
  ret = mParser.parse (null);
  filters = getFilters ();
  if (null != filters)
    for (int i = 0; i < filters.length; i++)
      ret = ret.extractAllNodesThatMatch (filters[i], mRecursive);
  return (ret);
}

代码示例来源:origin: CloudSlang/cs-actions

private void processHTMLBodyWithBASE64Images(MimeMultipart multipart) throws ParserException,
    MessagingException, NoSuchAlgorithmException, SMIMEException, java.security.NoSuchProviderException {
  if (null != body && body.contains("base64")) {
    Parser parser = new Parser(body);
    NodeList nodeList = parser.parse(null);
    HtmlImageNodeVisitor htmlImageNodeVisitor = new HtmlImageNodeVisitor();
    nodeList.visitAllNodesWith(htmlImageNodeVisitor);
    body = nodeList.toHtml();
    addAllBase64ImagesToMimeMultipart(multipart, htmlImageNodeVisitor.getBase64Images());
  }
}

代码示例来源:origin: com.github.tcnh/fitnesse

public HtmlTableScanner(String page) {
 if (page == null || page.equals(""))
  page = "<i>This page intentionally left blank.</i>";
 NodeList htmlTree;
 try {
  Parser parser = new Parser(new Lexer(new Page(page)));
  htmlTree = parser.parse(null);
 } catch (ParserException e) {
  throw new SlimError(e);
 }
 scanForTables(htmlTree);
}

代码示例来源:origin: org.fitnesse/fitnesse

public HtmlTableScanner(String page) {
 if (page == null || page.equals(""))
  page = "<i>This page intentionally left blank.</i>";
 NodeList htmlTree;
 try {
  Parser parser = new Parser(new Lexer(new Page(page)));
  htmlTree = parser.parse(null);
 } catch (ParserException e) {
  throw new SlimError(e);
 }
 scanForTables(htmlTree);
}

代码示例来源:origin: ScienJus/pixiv-crawler

/**
 * 提取多张图片
 * @param pageHtml
 * @return
 */
public List<String> parseManga(String pageHtml) {
  try {
    List<String> result = new ArrayList<String>();
    Parser parser = new Parser(pageHtml);
    NodeFilter filter = new AndFilter(new TagNameFilter("div"),new HasAttributeFilter("class","item-container"));
    NodeList list = parser.parse(filter);
    for (int i = 0; i < list.size(); i++) {
      Node item = list.elementAt(i);
      result.add(((ImageTag) item.getChildren().elementAt(2)).getAttribute("data-src"));
    }
    return result;
  } catch (ParserException e) {
    logger.error(e.getMessage());
  }
  return null;
}

代码示例来源:origin: org.apache.uima/ruta-core

@Override
public void process(JCas jcas) throws AnalysisEngineProcessException {
 String documentText = jcas.getDocumentText();
 List<AnnotationFS> annotations = new ArrayList<AnnotationFS>();
 List<AnnotationFS> annotationStack = new ArrayList<AnnotationFS>();
 try {
  Parser parser = new Parser(documentText);
  NodeList list = parser.parse(null);
  HtmlVisitor visitor = new HtmlVisitor(jcas, onlyContent);
  list.visitAllNodesWith(visitor);
  annotations = visitor.getAnnotations();
  annotationStack = visitor.getAnnotationStack();
 } catch (ParserException e) {
  throw new AnalysisEngineProcessException(e);
 }
 for (AnnotationFS each : annotations) {
  if (each.getBegin() < each.getEnd()) {
   jcas.addFsToIndexes(each);
  }
 }
 for (AnnotationFS each : annotationStack) {
  if (each.getBegin() < each.getEnd()) {
   jcas.addFsToIndexes(each);
  }
 }
}

代码示例来源:origin: org.fitnesse/fitnesse

private NodeList getMatchingTags(NodeFilter filter) throws Exception {
 String html = examiner.html();
 Parser parser = new Parser(new Lexer(new Page(html)));
 NodeList list = parser.parse(null);
 NodeList matches = list.extractAllNodesThatMatch(filter, true);
 return matches;
}

代码示例来源:origin: org.fitnesse/fitnesse

private NodeList makeNodeList(TestPage pageToTest) {
 String html = pageToTest.getHtml();
 Parser parser = new Parser(new Lexer(new Page(html)));
 try {
  return parser.parse(null);
 } catch (ParserException e) {
  throw new SlimError(e);
 }
}

代码示例来源:origin: com.bbossgroups/bboss-htmlparser

/**
 * Apply each of the filters.
 * The first filter is applied to the parser.
 * Subsequent filters are applied to the output of the prior filter.
 * @return A list of nodes passed through all filters.
 * @throws ParserException If an encoding change occurs
 * or there is some other problem.
 */
protected NodeList applyFilters ()
  throws
    ParserException
{
  NodeList ret;
  ret = new NodeList ();
  if (null != getFilters ())
    for (int i = 0; i < getFilters ().length; i++)
      if (0 == i)
        ret = mParser.parse (getFilters ()[0]);
      else
        ret = ret.extractAllNodesThatMatch (getFilters ()[i]);
  return (ret);
}

代码示例来源:origin: org.apache.uima/textmarker-core

@Override
public void process(JCas jcas) throws AnalysisEngineProcessException {
 String documentText = jcas.getDocumentText();
 List<AnnotationFS> annotations = new ArrayList<AnnotationFS>();
 List<AnnotationFS> annotationStack = new ArrayList<AnnotationFS>();
 try {
  Parser parser = new Parser(documentText);
  NodeList list = parser.parse(null);
  HtmlVisitor visitor = new HtmlVisitor(jcas, onlyContent);
  list.visitAllNodesWith(visitor);
  annotations = visitor.getAnnotations();
  annotationStack = visitor.getAnnotationStack();
 } catch (ParserException e) {
  throw new AnalysisEngineProcessException(e);
 }
 for (AnnotationFS each : annotations) {
  if (each.getBegin() < each.getEnd()) {
   jcas.addFsToIndexes(each);
  }
 }
 for (AnnotationFS each : annotationStack) {
  if (each.getBegin() < each.getEnd()) {
   jcas.addFsToIndexes(each);
  }
 }
}

代码示例来源:origin: com.github.tcnh/fitnesse

private NodeList makeNodeList(TestPage pageToTest) {
 String html = pageToTest.getHtml();
 Parser parser = new Parser(new Lexer(new Page(html)));
 try {
  return parser.parse(null);
 } catch (ParserException e) {
  throw new SlimError(e);
 }
}

代码示例来源:origin: com.github.tcnh/fitnesse

private NodeList getMatchingTags(NodeFilter filter) throws Exception {
 String html = examiner.html();
 Parser parser = new Parser(new Lexer(new Page(html)));
 NodeList list = parser.parse(null);
 NodeList matches = list.extractAllNodesThatMatch(filter, true);
 return matches;
}

代码示例来源:origin: ScienJus/pixiv-crawler

/**
 * 提取单张图片
 * @param pageHtml
 * @return
 */
public String parseMedium(String pageHtml) {
  try {
    Parser parser = new Parser(pageHtml);
    NodeFilter filter = new AndFilter(new TagNameFilter("img"),new HasAttributeFilter("class","original-image"));
    NodeList list = parser.parse(filter);
    if (list.size() > 0) {
      return ((ImageTag)list.elementAt(0)).getAttribute("data-src");
    }
  } catch (ParserException e) {
    logger.error(e.getMessage());
  }
  return null;
}

代码示例来源:origin: ScienJus/pixiv-crawler

/**
 * 在搜索列表中找到下一页的地址
 * @param pageHtml
 * @return
 */
public String parseNextPage(String pageHtml) {
  try {
    Parser parser = new Parser(pageHtml);
    NodeFilter filter = new AndFilter(new TagNameFilter("a"),new HasAttributeFilter("rel","next"));
    NodeList list =  parser.parse(filter);
    if(list.size() > 0) {
      return ((LinkTag)list.elementAt(0)).getLink();
    }
  } catch (ParserException e) {
    logger.error(e.getMessage());
  }
  return null;
}

相关文章