如何处理通过jsoup解析html时丢失的值?

cbeh67ev  于 2023-05-12  发布在  其他
关注(0)|答案(1)|浏览(262)

我有一个程序,应该刮网址,并获得所有元素的值。元素如下所示:
VALUE1 - VALUE3
或者它看起来像这样:
VALUE1/VALUE2 - VALUE3
HTML文件看起来是这样的:

`<ul>
 <li></li>
 <li></li>
 <li></li>
 <li></li>
 <li></li>
 <li>
      <a><span>VALUE1</span></a>
       -
      <span>VALUE3</span>
            </li>

 </ul>`

或者它可以这样看:

`<ul>
      <li></li>
      <li></li>
      <li></li>
      <li></li>
      <li></li>
      <li>
      <a><span>VALUE1</span></a>
       /
      <a><span>VALUE2</span></a>
       -
      <span>VALUE3</span>
            </li>

 </ul>`

我需要得到第一个和第三个值,所以我首先抓取所有的VALUE 1并将其放入ArrayList中,然后对VALUE 3执行同样的操作。但问题是第3章并不总是在那里!有时,VALUE 3 span标签可能只是不存在于网站上。
因此,当我将VALUE 1和VALUE 3从2个列表添加到1个列表时,第一个列表可能会更大,这可能会导致IndexOutOfBoundsException。我想也许我可以在我的第二个列表中添加一些东西,如果value 3丢失了?我该怎么做?我使用以下代码来抓取:

for (Element row : parse.select("ul>li>a>span")){ //to scrape the first one
                        String ing = row.getAllElements().text();
                        ingDebug1.add(ing);
                        debug = i;
                    }

                    for (Element row :parse.select("ul>li>span")){ //to scrape the third one
                        String ing = row.getAllElements().text();
                        debug = i;
                    }

                    for (int k = 0; k<ingDebug1.size(); k++){ // to put them together
                        ingShowUp.add(ingDebug1.get(k));
                        ingShowUp.add(ingDebug2.get(k));
                    }
hrysbysz

hrysbysz1#

这里有一个解决方案,它提供了两种过滤li节点的方法,这样您就可以只捕获那些将span作为其子节点之一的节点。(将线.filter(TopicAndDescriptionParser::isValidNode1)更改为.filter(TopicAndDescriptionParser::isValidNode2),以证明它们产生相同的结果。)
您没有说明如何从每个li中收集相关数据,因此我创建了记录TopicAndDescription来封装它。

import java.util.List;
import java.util.Optional;
import java.util.stream.Collectors;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.select.Elements;

public class TopicAndDescriptionParser {

    record TopicAndDescription(String topic, String description) {
    }

    public static void main(String[] args) {
        String snippet = """
            <ul>
                 <li></li>
                 <li></li>
                 <li></li>
                 <li></li>
                 <li>
                      <a><span>VALUE0</span></a>
                       -
                 </li>
                 <li>
                      <a><span>VALUE1</span></a>
                       -
                      <span></span>
                  </li>
                  <li>
                      <a><span>VALUE2</span></a>
                       -
                      <span>VALUE3</span>
                  </li>
                  <li>
                      <a><span>VALUE4</span></a>
                       /
                      <a><span>VALUE5</span></a>
                       -
                      <span>VALUE6</span>
                    </li>
             </ul>""";

        final Elements elements = Jsoup.parse(snippet).select("ul>li");
        final List<TopicAndDescription> list = elements.stream()
            .filter(TopicAndDescriptionParser::isValidNode2)
            .map(TopicAndDescriptionParser::getTopicAndDescription)
            .collect(Collectors.toList());
        System.out.println(list);
    }

    private static boolean isValidNode1(Element elem) {
        /*
         * Node names from your snippet will be: 
         *      #text, a, #text, span, #text
         * or
         *      #text, a, #text, a, #text, span, #text
         *
         * So, check that the *next to last* node name is "span"
         * 
         * Potential drawback: it assumes the index of the "span" node. Any
         * unexpected variation in the HTML may cause it to yield wrong results.
         */
        List<Node> childNodes = elem.childNodes();
        int size = childNodes.size();
        return size > 2 
            && "span".equals(((Element) childNodes.get(size - 2)).tagName())
            && !((Element) childNodes.get(size - 2)).text().isEmpty();
    }

    private static boolean isValidNode2(Element element) {
        /*
         * Is there a span node in the child nodes?
         * 
         * Pro: no indexing; con: using a stream might be consider heavy weight.
         */
        Optional<Node> spanNode = element.childNodes().stream()
            .filter(node -> "span".equals(node.nodeName()))
            .findFirst();
        
        return spanNode.isPresent() && !((Element)spanNode.get()).text().isEmpty();
    }

    private static TopicAndDescription getTopicAndDescription(Element element) {
        String topic = element.select("a:first-child").text();
        Elements spans = element.select("span");
        String description = spans.get(spans.size() - 1).text();
        return new TopicAndDescription(topic, description);
    }

}

输出为

[TopicAndDescription[topic=VALUE2, description=VALUE3], TopicAndDescription[topic=VALUE4, description=VALUE6]]

相关问题