如何打开有很多分支的html?

4xrmg8kj  于 2022-12-21  发布在  其他
关注(0)|答案(2)|浏览(88)

一个分支上可能有很多相同的标签,如何将它们全部保存到数据框中?
我试了下一段代码,但是重复的标签,比如RowData被替换成了futher数据。我的目标是保存完整的数据。

import pandas as pd
from xml.etree import ElementTree

path=str('data.xml')

with open(path, mode="r", encoding="utf-8") as f:
    xml_file = f.read()

items_delete=['<ObjectRelation>','</ObjectRelation>','<List>','</List>','<RowData>','</RowData>','<Kind>','</Kind>']

for item in items_delete:
    xml_file=xml_file.replace(item, '')

df = pd.read_xml(xml_file)

enter image description here
初始数据示例:

<ItemList>
    <ItemData>
        <ObjectRelation>
            <ObjectCadastreNr>01000180062</ObjectCadastreNr>
            <ObjectType>PARCEL</ObjectType>
        </ObjectRelation>
        <List>
            <RowData>
                <Kind>
                    <KindId>7312050201</KindId>
                    <KindName>ekspluatācijas aizsargjoslas teritorija gar elektrisko tīklu kabeļu līniju</KindName>
                </Kind>
                <Nr>1</Nr>
                <EstablishDate>1997-02-24</EstablishDate>
                <Area>0.0127</Area>
                <Measure>ha</Measure>
            </RowData>
            <RowData>
                <Kind>
                    <KindId>7312040200</KindId>
                    <KindName>ekspluatācijas aizsargjoslas teritorija gar elektronisko sakaru tīklu gaisvadu līniju</KindName>
                </Kind>
                <Nr>3</Nr>
                <EstablishDate>1996-01-13</EstablishDate>
            </RowData>
        </List>
    </ItemData>

    <ItemData>
        <ObjectRelation>
            <ObjectCadastreNr>01000180062</ObjectCadastreNr>
            <ObjectType>PARCEL</ObjectType>
        </ObjectRelation>
        <List>
            <RowData>

                <Kind>
                    <KindId>7312060100</KindId>
                    <KindName>ekspluatācijas aizsargjoslas teritorija gar pazemes siltumvadu, siltumapgādes iekārtu un būvi</KindName>
                </Kind>
                <Nr>5</Nr>
                <EstablishDate>1997-01-13</EstablishDate>
            </RowData>
        </List>
    </ItemData>
<ItemList>
hc2pp10m

hc2pp10m1#

您可以尝试使用beautifulsoup解析文档:

import pandas as pd
from bs4 import BeautifulSoup

xml_doc = """\
<ItemList>
    <ItemData>
        <ObjectRelation>
            <ObjectCadastreNr>01000180062</ObjectCadastreNr>
            <ObjectType>PARCEL</ObjectType>
        </ObjectRelation>
        <List>
            <RowData>
                <Kind>
                    <KindId>7312050201</KindId>
                    <KindName>ekspluatācijas aizsargjoslas teritorija gar elektrisko tīklu kabeļu līniju</KindName>
                </Kind>
                <Nr>1</Nr>
                <EstablishDate>1997-02-24</EstablishDate>
                <Area>0.0127</Area>
                <Measure>ha</Measure>
            </RowData>
            <RowData>
                <Kind>
                    <KindId>7312040200</KindId>
                    <KindName>ekspluatācijas aizsargjoslas teritorija gar elektronisko sakaru tīklu gaisvadu līniju</KindName>
                </Kind>
                <Nr>3</Nr>
                <EstablishDate>1996-01-13</EstablishDate>
            </RowData>
        </List>
    </ItemData>

    <ItemData>
        <ObjectRelation>
            <ObjectCadastreNr>01000180062</ObjectCadastreNr>
            <ObjectType>PARCEL</ObjectType>
        </ObjectRelation>
        <List>
            <RowData>

                <Kind>
                    <KindId>7312060100</KindId>
                    <KindName>ekspluatācijas aizsargjoslas teritorija gar pazemes siltumvadu, siltumapgādes iekārtu un būvi</KindName>
                </Kind>
                <Nr>5</Nr>
                <EstablishDate>1997-01-13</EstablishDate>
            </RowData>
        </List>
    </ItemData>
<ItemList>"""

soup = BeautifulSoup(xml_doc, "xml")

all_data = []
for data in soup.select("RowData"):
    d = {}
    d["ObjectCadastreNr"] = data.find_previous("ObjectCadastreNr").text.strip()
    d["ObjectType"] = data.find_previous("ObjectType").text.strip()

    for t in data.find_all(text=True):
        if t.strip() == "":
            continue
        d[t.parent.name] = t.strip()

    all_data.append(d)

df = pd.DataFrame(all_data)
print(df)

图纸:

ObjectCadastreNr ObjectType      KindId                                                                                       KindName Nr EstablishDate    Area Measure
0      01000180062     PARCEL  7312050201                     ekspluatācijas aizsargjoslas teritorija gar elektrisko tīklu kabeļu līniju  1    1997-02-24  0.0127      ha
1      01000180062     PARCEL  7312040200          ekspluatācijas aizsargjoslas teritorija gar elektronisko sakaru tīklu gaisvadu līniju  3    1996-01-13     NaN     NaN
2      01000180062     PARCEL  7312060100  ekspluatācijas aizsargjoslas teritorija gar pazemes siltumvadu, siltumapgādes iekārtu un būvi  5    1997-01-13     NaN     NaN
q0qdq0h2

q0qdq0h22#

你可以找到元素然后删除它。2这是XML,所以在删除子元素之前需要找到父元素。3下面是如何工作的想法。4在代码中添加注解。5希望这能有所帮助!

import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')

items_delete=['ObjectRelation','List','RowData','Kind']
#items_delete=['ObjectRelation']
for item in items_delete:
    for e in tree.findall(f'.//{item}/..'): # find the parent of a element
        child = e.find(f'./{item}') # get to the element
        e.remove(child) # remove element
tree.write('output.xml')

相关问题