编辑:在我尝试了第一个答案之后,我仍然会在中断标签时得到文本中断(就像以前一样)。代码如下:
for event, element in ET.iterparse(path):
if element.tag == "idsText":
# move sentences from stack
if sents:
cat_texts.extend(sents)
visited[current_doc] = visited.get(current_doc, 0) + 1
# another function call
annotate(cat_texts, filename, current_doc, visited[current_doc])
# reset cat_texts
cat_texts = []
# set new current document's name
current_doc = element.get("n")
sents = []
# new sentence starts
elif element.tag == "s" and type(element.text) == str:
if element.text.strip():
sentence = ' '.join(element.itertext())
new_sent.extend(nltk.word_tokenize(sentence, language='german'))
sents.append(new_sent)
new_sent = []
element.clear()
我有一些非常大的xml文件(> 6 GB),想从中提取文本。问题是文本被其他标记打断。这里有一个虚拟文件,让你有一个想法:
<doc>
<s> Here you can see a <ref target="SOME_URL" targOrder="u">reference</ref> in the text. </s>
</doc>
我想得到这个:
Here you can see a reference in the text.
我使用以下代码(尽管我在这里省略了细节):
import xml.etree.ElementTree as ET
for event, element in ET.iterparse(path):
if element.tag == "doc":
# do something here
elif element.tag == "s" and type(element.text) == str:
if element.text.strip():
# again do something here
element.clear()
应用于虚拟文件的代码将产生以下结果:
Here you can see a
我知道itertext()会产生我想要的输出:
import xml.etree.ElementTree as ET
myxml = '<doc><s> Here you can see a <ref target="SOME_URL" targOrder="u">reference</ref> in the text. </s></doc>'
tree = ET.fromstring(myxml)
print(''.join(tree.itertext()))
输出:
Here you can see a reference in the text.
但是我想不出一种方法可以将它与iterparse()(或任何其他增量解析方法)结合起来。因为我不能一次将xml解析成树,因为它的大小。增量地执行它意味着itertext()将不起作用,因为当解析带有标记的元素时,下面的标记(在本例中为<ref)还没有被解析。
有没有一种方法可以获取元素中的所有文本,并在增量解析时剥离标签?
非常感谢!
2条答案
按热度按时间pftdvrlh1#
可以使用
itertext
方法递归地迭代元素中包含的所有文本内容。如果我们像这样重写代码:然后,给定示例输入,我们得到以下输出:
q8l4jmvw2#
如果你还想看标签:
输出: