python 如何从XML中提取文本?

lymgl2op  于 2023-05-05  发布在  Python
关注(0)|答案(1)|浏览(220)

我的目标是提取每个框,形状,文本和框所在的游泳线。
到目前为止,我已经设法提取每一个盒子和它的形状。由于某些原因,代码识别框内的文本,但不显示它(而是显示硬件的位置)。
知道为什么吗

import xml.etree.ElementTree as ET

# Load the .VDX file
tree = ET.parse('Test VDX.vdx')
root = tree.getroot()

# Define the namespace used in the .VDX file
ns = {'visio': 'http://schemas.microsoft.com/visio/2003/core'}

# Find all shape elements in the .VDX file
pages = root.findall('.//visio:Page', ns)
shapes = root.findall('.//visio:Shape', ns)

# Iterate over the shapes and extract information
for page in pages:
    page_id = page.get('ID')
    page_name = page.get('NameU')
    print(f"Page ID: {page_id}, Name: {page_name}")

for shape in shapes:
    shape_id = shape.get('ID')
    shape_name = shape.get('Name')
    shape_type = shape.get('Type')
    shape_text_element = shape.find('.//visio:Text', ns)
    if shape_text_element is not None:
        shape_text = shape_text_element.text
    else:
        shape_text = 'TExt'
    print(f"Shape ID: {shape_id}, Name: {shape_name}, Type: {shape_type}, Text: {shape_text}")

我正在处理的文件是Microsoft Visio中的.vdx的xml文件。

<Shape ID="21" Type="Shape" Name="Rectangle Fill:Marble.21">
      <XForm>
        <Angle>-0</Angle>
        <PinX>5.413385833333333</PinX>
        <PinY>6.181102430555556</PinY>
        <Width>1.377952777777778</Width>
        <Height>0.59055125</Height>
        <LocPinX>0.6889763888888889</LocPinX>
        <LocPinY>0.295275625</LocPinY>
      </XForm>
      <TextXForm>
        <TxtPinX F="Width*0.500000">1.322397232055664</TxtPinX>
        <TxtLocPinX F="Width*0.500000">1.322397232055664</TxtLocPinX>
        <TxtPinY F="Height*0.500000">0.4719645182291667</TxtPinY>
        <TxtLocPinY F="Height*0.500000">0.4719645182291667</TxtLocPinY>
        <TxtWidth F="Width*1">1.322397232055664</TxtWidth>
        <TxtHeight F="Height*1">0.4719645182291667</TxtHeight>
        <TxtAngle>-0</TxtAngle>
      </TextXForm>
      <Prop ID="0" NameU="Row_0">
        <Type>0</Type>
        <Value Unit="STR">Purchaser</Value>
        <Label/>
      </Prop>
      <Prop ID="1" NameU="Row_1">
        <Type>0</Type>
        <Value Unit="STR">0</Value>
        <Label>Cost</Label>
      </Prop>
      <Prop ID="2" NameU="Row_2">
        <Type>0</Type>
        <Value Unit="STR">0</Value>
        <Label>Duration</Label>
      </Prop>
      <Prop ID="3" NameU="Row_3">
        <Type>0</Type>
        <Value Unit="STR">0</Value>
        <Label>Resources</Label>
      </Prop>
      <Misc>
        <ObjType>1</ObjType>
      </Misc>
      <Line>
        <LinePattern>1</LinePattern>
        <LineWeight>0.00333333</LineWeight>
        <LineColor>0</LineColor>
        <LineColorTrans>0</LineColorTrans>
        <Rounding>0</Rounding>
        <LineCap>0</LineCap>
      </Line>
      <Fill>
        <FillPattern>1</FillPattern>
        <FillForegnd>#e8eef7</FillForegnd>
        <FillForegndTrans>0</FillForegndTrans>
        <ShdwPattern>0</ShdwPattern>
        <ShdwForegnd>#ffffff</ShdwForegnd>
        <ShdwForegndTrans>0</ShdwForegndTrans>
        <ShapeShdwType>1</ShapeShdwType>
        <ShapeShdwOffsetX>0.11811</ShapeShdwOffsetX>
        <ShapeShdwOffsetY>-0.11811</ShapeShdwOffsetY>
      </Fill>
      <Geom IX="0">
        <NoFill>0</NoFill>
        <NoLine>0</NoLine>
        <MoveTo IX="1">
          <X F="Width*0.000000">0</X>
          <Y F="Height*1.000000">0.5905512499999995</Y>
        </MoveTo>
        <LineTo IX="2">
          <X F="Width*1.000000">1.377952777777777</X>
          <Y F="Height*1.000000">0.5905512499999995</Y>
        </LineTo>
        <LineTo IX="3">
          <X F="Width*1.000000">1.377952777777777</X>
          <Y F="Height*0.000000">0</Y>
        </LineTo>
        <LineTo IX="4">
          <X F="Width*0.000000">0</X>
          <Y F="Height*0.000000">0</Y>
        </LineTo>
        <LineTo IX="5">
          <X F="Width*0.000000">0</X>
          <Y F="Height*1.000000">0.5905512499999995</Y>
        </LineTo>
      </Geom>
      <LayerMem>
        <LayerMember>0</LayerMember>
      </LayerMem>
      <Connection ID="0">
        <X F="Width*0.000000">0</X>
        <Y F="Width*0.214286">0.2952755555555563</Y>
        <Type>0</Type>
      </Connection>
      <Connection ID="1">
        <X F="Width*1.000000">1.377952777777777</X>
        <Y F="Width*0.214286">0.2952755555555563</Y>
        <Type>0</Type>
      </Connection>
      <Connection ID="2">
        <X F="Width*0.500000">0.6889763888888886</X>
        <Y F="Width*0.000000">0</Y>
        <Type>0</Type>
      </Connection>
      <Connection ID="3">
        <X F="Width*0.500000">0.6889763888888886</X>
        <Y F="Width*0.428571">0.5905512499999995</Y>
        <Type>0</Type>
      </Connection>
      <TextBlock>
        <LeftMargin>0.0277778</LeftMargin>
        <RightMargin>0.0277778</RightMargin>
        <TopMargin>0.0277778</TopMargin>
        <BottomMargin>0.0277778</BottomMargin>
        <VerticalAlign>1</VerticalAlign>
        <DefaultTabStop>0</DefaultTabStop>
      </TextBlock>
      <Char IX="0">
        <Font>0</Font>
        <Color>0</Color>
        <Style>0</Style>
        <Size>0.138889</Size>
        <ColorTrans>0</ColorTrans>
      </Char>
      <Para IX="0">
        <IndFirst>0</IndFirst>
        <IndLeft>0</IndLeft>
        <IndRight>0</IndRight>
        <SpLine>-1.2</SpLine>
        <SpBefore>0</SpBefore>
        <HorzAlign>1</HorzAlign>
      </Para>
      <Text><cp IX="0"/><pp IX="0"/>Attach Pos to invoice and complete coding form</Text>
    </Shape>

我已经尝试转换.text或将其转换为字符串,但没有任何工作。

umuewwlo

umuewwlo1#

您需要连接<Text>-标签的text属性和所有子标签的tail属性。考虑到两者都可以是None,代码应该类似于:

import xml.etree.ElementTree as ET
xml = ET.fromstring('''<Text><cp IX="0"/><pp IX="0"/>Attach Pos to invoice and complete coding form</Text>''')
textelem = xml
text = "" if textelem.text is None else textelement.text
for child in textelem:
    text += "" if child.tail is None else child.tail
print(text)

相关问题