如何在Python中从非常大的XML文件中提取文本而不中断标记,同时进行增量解析?

xkrw2x1b  于 2023-05-27  发布在  Python
关注(0)|答案(2)|浏览(144)

编辑:在我尝试了第一个答案之后,我仍然会在中断标签时得到文本中断(就像以前一样)。代码如下:

for event, element in ET.iterparse(path):
        if element.tag == "idsText":
            # move sentences from stack
            if sents:
                cat_texts.extend(sents)
                visited[current_doc] = visited.get(current_doc, 0) + 1
                # another function call
                annotate(cat_texts, filename, current_doc, visited[current_doc])
                # reset cat_texts
                cat_texts = []
            # set new current document's name
            current_doc = element.get("n")
            sents = []
        # new sentence starts
        elif element.tag == "s" and type(element.text) == str:
            if element.text.strip():
                sentence = ' '.join(element.itertext())
                new_sent.extend(nltk.word_tokenize(sentence, language='german'))
                sents.append(new_sent)
                new_sent = []

        element.clear()

我有一些非常大的xml文件(> 6 GB),想从中提取文本。问题是文本被其他标记打断。这里有一个虚拟文件,让你有一个想法:

<doc>
    <s> Here you can see a <ref target="SOME_URL" targOrder="u">reference</ref> in the text. </s>
</doc>

我想得到这个:

Here you can see a reference in the text.

我使用以下代码(尽管我在这里省略了细节):

import xml.etree.ElementTree as ET
for event, element in ET.iterparse(path):
    if element.tag == "doc":
        # do something here            
    
    elif element.tag == "s" and type(element.text) == str:
        if element.text.strip():
            # again do something here

    element.clear()

应用于虚拟文件的代码将产生以下结果:

Here you can see a

我知道itertext()会产生我想要的输出:

import xml.etree.ElementTree as ET
myxml = '<doc><s> Here you can see a <ref target="SOME_URL" targOrder="u">reference</ref> in the text. </s></doc>'
tree = ET.fromstring(myxml)
print(''.join(tree.itertext()))

输出:

Here you can see a reference in the text.

但是我想不出一种方法可以将它与iterparse()(或任何其他增量解析方法)结合起来。因为我不能一次将xml解析成树,因为它的大小。增量地执行它意味着itertext()将不起作用,因为当解析带有标记的元素时,下面的标记(在本例中为<ref)还没有被解析。
有没有一种方法可以获取元素中的所有文本,并在增量解析时剥离标签?
非常感谢!

pftdvrlh

pftdvrlh1#

可以使用itertext方法递归地迭代元素中包含的所有文本内容。如果我们像这样重写代码:

import xml.etree.ElementTree as ET
for event, element in ET.iterparse('data.xml'):
        if element.tag == 's':
            print(' '.join(element.itertext()))

然后,给定示例输入,我们得到以下输出:

Here you can see a  reference  in the text.
q8l4jmvw

q8l4jmvw2#

如果你还想看标签:

import xml.etree.ElementTree as ET
import io
from html.parser import HTMLParser

xml="""<doc>
    <s> Here you can see a <ref target="SOME_URL" targOrder="u">reference</ref> in the text. </s>
</doc>"""

infile = io.StringIO(xml)

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if len(attrs) != 0:
            print("Encountered a start tag:", tag, attrs)
        else:
            print("Encountered a start tag:", tag)
            
    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()

for event, element in ET.iterparse(infile, events=("end",)):
    if event == "end" and element.tag == 's':
        print(ET.tostring(element).decode("utf-8"))
        print(parser.feed(ET.tostring(element).decode("utf-8")))

输出:

<s> Here you can see a <ref target="SOME_URL" targOrder="u">reference</ref> in the text. </s>

Encountered a start tag: s
Encountered some data  :  Here you can see a 
Encountered a start tag: ref [('target', 'SOME_URL'), ('targorder', 'u')]
Encountered some data  : reference
Encountered an end tag : ref
Encountered some data  :  in the text. 
Encountered an end tag : s
Encountered some data  : None

相关问题