我正在处理字符串格式的文本数据。我想知道如何提取部分字符串如下:
data = '<?xml version_history="1.0" encoding="utf-8"?><feed xml:base="https://dummydomain.facebook.com/" xmlns="http://www.w3.org/2005/Atom" xmlns:d="http://schemas.microsoft.com/ado/2008/09/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2008/09/dataservices/metadata" xmlns:georss="http://www.georss.org/georss" xmlns:gml="http://www.opengis.net/gml"><id>aad232-c2cc-42ca-ac1e-e1d1b4dd55de</id><title<d:VersionType>3.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_h84_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>3. Contract Signed<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2020-07-30T12:15:04Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Amy, Jackson</d:LookupValue><d:Email>Amy.Jackson@doe.com</d:Email><id>af62fe09-fds2-42ca-a12c1e-e1d1b4dd55de</id><title<d:VersionType>2.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_x0342fs_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>2. Active Discussion<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2020-02-15T18:15:60Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Sam, Joseph</d:LookupValue><d:Email>Sam. Joseph @doe.com</d:Email></d:LookupValue><d:Email>Amy.Jackson@doe.com</d:Email><id>af62fe09-fds2-42ca-a12c1e-e1d1b4dd55de</id><title<d:VersionType>1.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_x0342fs_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>1. Exploratory<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2019-07-15T10:20:04Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Sam, Joseph</d:LookupValue><d:Email>Sam. Joseph @doe.com</d:Email>'
我想提取所有<d:VersionType>,<d:Name>,<d:Stage>和<d:Modified m:type="Edm.DateTime">
预期产出:
d:VersionType d:Name d:Stage d:Modified m:type="Edm.DateTime"
3.0 XYZ Company 3. Contract 2020-07-30T12:15:04Z
2.0 XYZ Company 2. Contract 2020-02-15T18:15:60Z
1.0 XYZ Company 1. Exploratory 2019-07-15T10:20:04Z
1条答案
按热度按时间wqnecbli1#
尝试使用beautiful soup,因为它可以让你解析xml,html和其他文档。这样的文件已经具有特定的结构,您不必从头开始构建正则表达式,这使您的工作变得容易得多。
将
d:VersionType
替换为您希望(d:Name, d:Stage, ..)
也提取其内容的其他元素。