NodeJS 获取HTML格式的上一个标题文本

py49o6xq  于 2023-01-04  发布在  Node.js
关注(0)|答案(1)|浏览(106)

我有一个HTML,看起来像这样:

<h1>Title</h1>
<p>Some additional content, can be multiple, various tags</p>
<h2><a id="123"></a>Foo</h2>
<p>Some additional content, can be multiple, various tags</p>
<h3><a id="456"></a>Bar</h3>

现在,对于每个具有id的锚,我想找出头部层次结构,例如,对于具有id="123"的锚,我想得到类似于[{level: 1, title: "Title"}, {level: 2, title: "Foo"}]的内容,类似地,对于具有id="456"的锚,我想得到[{level: 1, title: "Title"}, {level: 2, title: "Foo"}, {level: 3, title: "Bar"}]
我的代码看起来像这样:

const linkModel: IDictionary<ILinkModelEntry> = {};
const $ = cheerio.load(html);
$("a").each((_i, elt) => {
    const anchor = $(elt);
    const id = anchor.attr().id;
    if (id) {
        const parent = anchor.parent();
        const parentTag = parent.prop("tagName");
        let headerHierarchy: any[] = [];
        if (["H1", "H2", "H3", "H4", "H5", "H6"].includes(parentTag)) {
            let level = parseInt(parentTag[1]);
            headerHierarchy = [{level, text: parent.text()}];
            level--;
            while (level > 0) {
                const prevHeader = parent.prev("h" + level);
                const text = prevHeader.text();
                headerHierarchy.unshift({level, text});
                level--;
            }
        }
        linkModel["#" + id] = {originalId: id, count: count++, headerHierarchy};
    }
});

我做错了什么

const prevHeader = parent.prev("h" + level);
const text = prevHeader.text();

是否始终返回空字符串(即"")?

d5vmydt9

d5vmydt91#

如果我没理解错的话,你需要捕捉层次结构,如果你的例子有另一个<h1>,后面跟着更多的<h2><h3>,你会希望把父元素的堆栈向下弹出到新的<h1>级别,以便链接未来的<h2><h3>子元素,而不是把所有元素的数组向上弹出到第一个<h1>Title</h1>
这里有一个方法:

const cheerio = require("cheerio"); // ^1.0.0-rc.12

const html = `
<h1>Title</h1>
<p>Some additional content, can be multiple, various tags</p>
<h2><a id="123"></a>Foo</h2>
<p>Some additional content, can be multiple, various tags</p>
<h3><a id="456"></a>Bar</h3>
<h1>Another Title</h1>
<h2><a id="xxx"></a>Foo 2</h2>
<h3><a id="yyy"></a>Bar 2</h3>`;

const $ = cheerio.load(html);
const result = {};
const stack = [];

[...$("h1,h2,h3,h4,h5,h6")].forEach(e => {
  const level = +$(e).prop("tagName")[1];

  while (stack.length && level <= stack.at(-1).level) {
    stack.pop();
  }

  if (!stack.length || level >= stack.at(-1).level) {
    stack.push({level, title: $(e).text()});
  }

  if ($(e).has("a[id]").length) {
    const id = $(e).find("a[id]").attr("id");
    result[`#${id}`] = [...stack];
  }
});

console.log(result);

输出:

{
  '#123': [ { level: 1, title: 'Title' }, { level: 2, title: 'Foo' } ],
  '#456': [
    { level: 1, title: 'Title' },
    { level: 2, title: 'Foo' },
    { level: 3, title: 'Bar' }
  ],
  '#xxx': [
    { level: 1, title: 'Another Title' },
    { level: 2, title: 'Foo 2' }
  ],
  '#yyy': [
    { level: 1, title: 'Another Title' },
    { level: 2, title: 'Foo 2' },
    { level: 3, title: 'Bar 2' }
  ]
}

如果您真的希望整个祖先链线性地回到第一个,则删除while循环(不太可能是您的意图)。

相关问题