将XML文档解析为PandasDataFrame

1wnzp6jl 于 2023-01-24 发布在其他

关注(0)|答案(4)|浏览(175)

我有一个XML文件，如下所示：

<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>

我正在尝试做的是提取ID，文本和CreationDate列到PandasDF，我已经尝试如下：

import xml.etree.cElementTree as et
import pandas as pd
path = '/.../...'
dfcols = ['ID', 'Text', 'CreationDate']
df_xml = pd.DataFrame(columns=dfcols)

root = et.parse(path)
rows = root.findall('.//row')
for row in rows:
    ID = row.find('Id')
    text = row.find('Text')
    date = row.find('CreationDate')
    print(ID, text, date)
    df_xml = df_xml.append(pd.Series([ID, text, date], index=dfcols), ignore_index=True)

print(df_xml)

但输出结果是：

None None None

我该怎么解决这个问题？

pandas

来源：https://stackoverflow.com/questions/50774222/parsing-xml-document-into-a-pandas-dataframe

4条答案

按热度按时间

pes8fvy91#

正如金牌会员Python/pandas/numpy guru在此解决方案中建议的那样，@unutbu：
不要在for循环中调用DataFrame.append或pd.concat，这会导致二次复制。
因此，考虑将XML数据解析为一个单独的列表，然后在任何循环之外的一个调用中将列表传递给DataFrame构造函数。实际上，可以将带有列表解析的嵌套列表直接传递给构造函数：

path = 'AttributesXMLPandas.xml'
dfcols = ['ID', 'Text', 'CreationDate']

root = et.parse(path)
rows = root.findall('.//row')

# NESTED LIST
xml_data = [[row.get('Id'), row.get('Text'), row.get('CreationDate')] 
            for row in rows]

df_xml = pd.DataFrame(xml_data, columns=dfcols)

print(df_xml)

#   ID   Text             CreationDate
# 0  1  (...)  2011-08-30T21:15:28.063
# 1  2  (...)  2011-08-30T21:24:56.573
# 2  3  (...)                     None

赞(0）回复(0）举报 2023-01-24

okxuctiv2#

只是代码中的一个小改动

ID = row.get('Id')
text = row.get('Text')
date = row.get('CreationDate')

赞(0）回复(0）举报 2023-01-24

0lvr5msh3#

基于@Parfait解决方案，我编写了我的版本，它将列作为参数获取并返回Pandas DataFrame。
test.xml:

<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(.1.)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(.2.)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(.3.)" UserId="9" />
</comments>

xml文件到Pandas.py文件：

'''Xml to Pandas DataFrame Convertor.'''

import xml.etree.cElementTree as et
import pandas as pd

def xml_to_pandas(root, columns, row_name):
  '''get xml.etree root, the columns and return Pandas DataFrame'''
  df = None
  try:

    rows = root.findall('.//{}'.format(row_name))

    xml_data = [[row.get(c) for c in columns] for row in rows]  # NESTED LIST

    df = pd.DataFrame(xml_data, columns=columns)
  except Exception as e:
    print('[xml_to_pandas] Exception: {}.'.format(e))

  return df

path = 'test.xml'
row_name = 'row'
columns = ['ID', 'Text', 'CreationDate']

root = et.parse(path)
df = xml_to_pandas(root, columns, row_name)
print(df)

输出：

赞(0）回复(0）举报 2023-01-24

vwoqyblh4#

从pandas 1.3.0开始，有一个内置的pandas函数pd.read_xml，可以将XML文档读入pandas DataFrame。

path = """<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>"""

# or a path to an XML doc
path = 'test.xml'
pd.read_xml(path)

通过简单地调用read_xml，OP中的XML文档变成如下所示：

赞(0）回复(0）举报 2023-01-24

我来回答

将XML文档解析为PandasDataFrame

4条答案

相关问题

热门标签

最新问答