在Python中基于标记将大型xml文件拆分为多个

njthzxwz  于 2022-12-25  发布在  Python
关注(0)|答案(3)|浏览(210)

我有一个非常大的xml文件,我需要根据特定的标签将其拆分为几个。XML文件如下所示:

<xml>
<file id="13">
  <head>
    <talkid>2458</talkid>
    <transcription>
      <seekvideo id="645">So in college,</seekvideo>
      ...
    </transcription>
  </head>
  <content> *** This is the content I am trying to save *** </content>
</file>
<file>
      ... 
</file>
</xml>

我想提取每个文件内容并根据talkid保存。
下面是我尝试过的代码:

import xml.etree.ElementTree as ET

all_talks = 'path\\to\\big\\file'

context = ET.iterparse(all_talks, events=('end', ))
for event, elem in context:
    if elem.tag == 'file':
        content = elem.find('content').text
        title = elem.find('talkid').text
        filename = format(title + ".txt")
        with open(filename, 'wb', encoding='utf-8') as f:
            f.write(ET.tostring(content), encoding='utf-8')

但我得到了以下错误:

AttributeError: 'NoneType' object has no attribute 'text'
dgenwo3n

dgenwo3n1#

如果您已经在使用.iterparse(),那么只依赖于事件会更通用:

import xml.etree.ElementTree as ET
from pathlib import Path

all_talks = Path(r'file.xml')
context = ET.iterparse(all_talks, events=('start', 'end'))

for event, element in context:
    if event == 'end':
        if element.tag == 'talkid':
            title = element.text
        elif element.tag == 'content':
            content = element.text
        elif element.tag == 'file' and title and content:
            with open(all_talks.with_name(title + '.txt'), 'w') as f:
                f.write(content)
    elif element.tag == 'file':
        content = title = None

**更新。**在similar question@Leila中,询问如何将所有<seekvideo>标记中的文本写入文件,而不是将<content>中的文本写入文件,因此,以下是一个解决方案:

import xml.etree.ElementTree as ET
from pathlib import Path

all_talks = Path(r'file.xml')
context = ET.iterparse(all_talks, events=('start', 'end'))

for event, element in context:
    if event == 'end':
        if element.tag == 'file' and title and parts:
            with open(all_talks.with_name(title + '.txt'), 'w') as f:
                f.write('\n'.join(parts))
        elif element.text:
            if element.tag == 'talkid':
                title = element.text
            elif element.tag == 'seekvideo':
                parts.append(element.text)
    elif element.tag == 'file':
        title = None
        parts = []
dbf7pr2w

dbf7pr2w2#

试着这样做..
问题在于talkid是head标签而不是file标签的子标签。

import xml.etree.ElementTree as ET

all_talks = 'file.xml'

context = ET.iterparse(all_talks, events=('end', ))
for event, elem in context:
    if elem.tag == 'file':
        head = elem.find('head')
        content = elem.find('content').text
        title = head.find('talkid').text
        filename = format(title + ".txt")
        with open(filename, 'wb') as f:  # 'wt' or just 'w' if you want to write text instead of bytes
            f.write(content.encode())    # in which case you would remove the .encode()
cbjzeqam

cbjzeqam3#

您可以使用Beautiful Soup来解析xml。
它应该是这样的(我在XML中添加了第二个talk id,以演示如何查找多个标记)

xml_file = '''<xml>
<file id="13">
  <head>
    <talkid>2458</talkid>
    <transcription>
      <seekvideo id="645">So in college,</seekvideo>
      ...
    </transcription>
     <talkid>second talk id</talkid>
  </head>
  <content> *** This is the content I am trying to save *** </content>
</file>
<file>
      ... 
</file>
</xml>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(xml_file, "xml")

first_talk_id = soup.find('talkid').get_text()
talk_ids = soup.findAll('talkid')

print(first_talk_id)
# prints 2458

for talk in talk_ids:
    print(talk.get_text())

# prints 
# 2458
# second talk id

注意:你需要为bs4安装一个解析器来处理xml pip install lxml

相关问题