python LXML不想解析注解后的文本

jtw3ybtb  于 2023-06-28  发布在  Python
关注(0)|答案(5)|浏览(64)

我想将tag.text Package 到CDATA中:

<?xml version="1.0" encoding="utf-8" ?>
<root>
  <tag>
    some data
    <!-- some data2 -->
    <!-- some data2 -->
    some data
  </tag>
</root>

但是当我解析tag.text时,里面有注解,只解析注解之前的文本:

from lxml import etree

parser = etree.XMLParser()
#parser = etree.XMLParser(remove_comments=True)
tree = etree.parse("./data.xml", parser)
root = tree.getroot()

for tag in root.findall("tag"):
    tag.text = etree.CDATA(tag.text)

tree.write("./result.xml",
           encoding = "utf-8",
           xml_declaration = True)

我得到了这个(tag.text = some data):

<?xml version='1.0' encoding='UTF-8'?>
<root>
  <tag><![CDATA[
    some data
    ]]><!-- some data2 -->
    <!-- some data2 -->
    some data
  </tag>
</root>

怎么修?

whlutmcx

whlutmcx1#

请考虑使用saxonche和XSLT 3.0:

from saxonche import *

with PySaxonProcessor(license=False) as saxon_proc:
    xslt30_processor = saxon_proc.new_xslt30_processor()

    xslt30_processor.transform_to_file(source_file='sample1.xml', stylesheet_file='serialize-wrap-in-cdata1.xsl', output_file='result-sample1.xml')

XSLT3是例如

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="#all"
    expand-text="yes"
    version="3.0">

  <xsl:param name="cdata-tag-names" as="xs:string*" static="yes" select="'tag'"/>

  <xsl:mode on-no-match="shallow-copy"/>

  <xsl:output method="xml" _cdata-section-elements="{$cdata-tag-names}"/>

  <xsl:template _match="{$cdata-tag-names => string-join(' | ')}">
    <xsl:copy>{serialize(node())}</xsl:copy>
  </xsl:template>

</xsl:stylesheet>

sample1.xml是您的输入:

<?xml version="1.0" encoding="utf-8" ?>
<root>
  <tag>
    some data
    <!-- some data2 -->
    <!-- some data2 -->
    some data
  </tag>
</root>

包含文件的公共Gist:https://gist.github.com/martin-honnen/61b91233fd73369d55f392ad4a0cee0b

vsdwdz23

vsdwdz232#

如果要连接<tag>元素中的所有文本,可以对elements itertext方法使用str.join方法。这将在传递给CDATA方法之前连接所有包括空格在内的文本。

for tag in root.findall("tag"):
    tag.text = etree.CDATA(''.join(tag.itertext()))

在您的示例中,注解被视为<tag>元素的子元素。当使用itertext方法时,将迭代尾部文本。

arknldoa

arknldoa3#

我发现了一种巧妙的方法来解析和修改文本,注解和尾部:

tmp = etree.tostring(tag).decode()
// here you need to remove <tag> from tmp string
tag.clear()
tag.text = etree.CDATA(tmp)

如果有人知道更正确/漂亮的方法来做这件事(例如,像tag.all这样的东西),请写信。

l2osamch

l2osamch4#

迭代tag元素,得到它的文本+注解元素的文本表示(没有尾文本)+任何尾文本(包括缩进)。然后删除子元素并使用CDATA Package 文本填充标记元素。

from lxml import etree

parser = etree.XMLParser()
tree = etree.parse("tmp.xml", parser)
root = tree.getroot()

for s in root.findall("tag"):
    t = s.text
    for ele in s.iterchildren():
        t += etree.tostring(ele, with_tail=False).decode("utf8")
        t += ele.tail
        # remove item
        ele.getparent().remove(ele)
    s.text = etree.CDATA(t)
    #print(etree.tostring(s).decode("utf8"))

print(etree.tostring(tree, with_tail=True).decode("utf8"))

结果

<root>
  <tag><![CDATA[
    some data
    <!-- some data2 -->
    <!-- some data2 -->
    some data
  ]]></tag>
</root>
gwo2fgha

gwo2fgha5#

xml.etree.ElementTree拥有ET.iterparse(),负责检测事件,包括注解:

import xml.etree.ElementTree as ET
from io import StringIO

xml_file = f"""<?xml version="1.0" encoding="utf-8" ?>
<root>
  <tag>
    some data 1
    <!-- some data2 -->
    <!-- some data3 -->
    some data 4
  </tag>
</root>
"""

f = StringIO(xml_file)

for event, elem in ET.iterparse(f, events=('start','comment')):
    if elem.tag == 'tag' and event == 'start':
        print('Text start', elem.text)
    if '<function Comment' in repr(elem.tag):
        print("Comment", elem.text)

输出:

Text start 
    some data 1
    
    
    some data 4
  
Comment  some data2 
Comment  some data3

以下是lxml的采用情况:

from lxml import etree
from io import BytesIO

xml_file = f"""<?xml version="1.0" encoding="utf-8" ?>
<root>
  <tag>
    some data 1
    <!-- some data2 -->
    <!-- some data3 -->
    some data 4
  </tag>
</root>
"""

f = BytesIO(xml_file.encode('utf-8'))

for event, elem in etree.iterparse(f, events=('start','comment')):
    if elem.tag == 'tag' and event == 'start':
        print('Text start', elem.text)
    if '<cyfunction Comment' in repr(elem.tag):
        print("Comment", elem.text, elem.tail)

输出:

Text start 
    some data 1
    
Comment  some data2  
    
Comment  some data3  
    some data 4

相关问题