python-3.x 从带有自关闭标记的xml导出文本

tvokkenx 于 2023-02-26 发布在 Python

关注(0)|答案(1)|浏览(116)

我有一套XML TEI文件，包含转录的文件。我想解析这些XML文件，并提取只有文本信息。
我的XML看起来像：

<?xml version='1.0' encoding='UTF8'?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <text>
    <body>
      <ab>
        <pb n="page1"/>
          <cb n="1"/>
            <lb xml:id="DD1" n="1"/>my sentence 1
            <lb xml:id="DD2" n="2"/>my sentence 2
            <lb xml:id="DD3" n="3"/>my sentence 3
          <cb n="2"/>
            <lb xml:id="DD1" n="1"/>my sentence 4
            <lb xml:id="DD2" n="2"/>my sentence 5
        <pb n="page2"/>
          <cb n="1"/>
            <lb xml:id="DD1" n="1"/>my sentence 1
            <lb xml:id="DD2" n="2"/>my sentence 2
          <cb n="2"/>
            <lb xml:id="DD1" n="1"/>my sentence 3
            <lb xml:id="DD1" n="2"/>my sentence 4
      </ab>
    </body>
  </text>
</TEI>

我已经尝试使用LXML访问这些信息，方法是：

with open(file,'r') as my_file:
    
    root = ET.parse(my_file, parser = ET.XMLParser(encoding = 'utf-8'))
    list_pages = root.findall('.//{http://www.tei-c.org/ns/1.0}pb')
    for page in list_pages:
        liste_text = page.findall('.//{http://www.tei-c.org/ns/1.0}lb')
    
    final_text = []
    
    for content in liste_text:
        final_text.append(content.text)

我希望在结尾处有这样的内容：

page1
my sentence 1
my sentence 2
my sentence 3
my sentence 4
my sentence 5
page2
my sentence 1
my sentence 2
my sentence 3
my sentence 4

如果我成功访问了lb对象，没有文本信息链接到它们。您能帮我提取这些信息吗？谢谢

python-3.x

来源：https://stackoverflow.com/questions/75496091/export-text-from-xml-with-self-closing-tag

1条答案

按热度按时间

4c8rllxm1#

请注意，您的xml可能有一个问题，因为您有几个xml:id属性具有相同的属性值。
假设这是固定的，如果你用lxml代替ElementTree，会更容易做到，因为lxml有更好的xpath支持：

from lxml import etree
root = etree.parse(my_file)
for p in root.xpath('//*[name()="pb"]'):
    print(p.xpath('./@n')[0].strip())
    for lb in p.xpath('.//following-sibling::*[not(name()="cb")]'):
        if lb.xpath('name()') == "pb":
            break
        else:
            print(lb.tail.strip())

输出应该是您预期的输出。

赞(0）回复(0）举报 2023-02-26

我来回答

python-3.x 从带有自关闭标记的xml导出文本

1条答案

相关问题

热门标签

最新问答