工作xml数据-在Python中提取字符串的一部分

gj3fmq9x  于 2023-06-25  发布在  Python
关注(0)|答案(1)|浏览(126)

我正在处理字符串格式的文本数据。我想知道如何提取部分字符串如下:

data = '<?xml version_history="1.0" encoding="utf-8"?><feed xml:base="https://dummydomain.facebook.com/" xmlns="http://www.w3.org/2005/Atom" xmlns:d="http://schemas.microsoft.com/ado/2008/09/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2008/09/dataservices/metadata" xmlns:georss="http://www.georss.org/georss" xmlns:gml="http://www.opengis.net/gml"><id>aad232-c2cc-42ca-ac1e-e1d1b4dd55de</id><title<d:VersionType>3.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_h84_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>3. Contract Signed<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2020-07-30T12:15:04Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Amy, Jackson</d:LookupValue><d:Email>Amy.Jackson@doe.com</d:Email><id>af62fe09-fds2-42ca-a12c1e-e1d1b4dd55de</id><title<d:VersionType>2.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_x0342fs_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>2. Active Discussion<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2020-02-15T18:15:60Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Sam, Joseph</d:LookupValue><d:Email>Sam. Joseph @doe.com</d:Email></d:LookupValue><d:Email>Amy.Jackson@doe.com</d:Email><id>af62fe09-fds2-42ca-a12c1e-e1d1b4dd55de</id><title<d:VersionType>1.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_x0342fs_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>1. Exploratory<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2019-07-15T10:20:04Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Sam, Joseph</d:LookupValue><d:Email>Sam. Joseph @doe.com</d:Email>'

我想提取所有<d:VersionType>,<d:Name>,<d:Stage>和<d:Modified m:type="Edm.DateTime">
预期产出:

d:VersionType    d:Name        d:Stage         d:Modified m:type="Edm.DateTime"
3.0              XYZ Company   3. Contract     2020-07-30T12:15:04Z
2.0              XYZ Company   2. Contract     2020-02-15T18:15:60Z
1.0              XYZ Company   1. Exploratory  2019-07-15T10:20:04Z
wqnecbli

wqnecbli1#

尝试使用beautiful soup,因为它可以让你解析xml,html和其他文档。这样的文件已经具有特定的结构,您不必从头开始构建正则表达式,这使您的工作变得容易得多。

from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'xml')

version_type = [item.text for item in soup.findAll('d:VersionType')] # gives ['3.0', '2.0', '1.0']

d:VersionType替换为您希望(d:Name, d:Stage, ..)也提取其内容的其他元素。

相关问题