python 使用BeautifulSoup解析带有子节点的SEC埃德加XML表单数据

我正尝试使用beautiful soup和xml从SEC的N-PORT-P/A表格中提取单个基金的持有情况。下面列出了一个典型的提交文件，[链接在这里][1]如下所示：

<edgarSubmission xmlns="http://www.sec.gov/edgar/nport" xmlns:com="http://www.sec.gov/edgar/common" xmlns:ncom="http://www.sec.gov/edgar/nportcommon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<headerData>
<submissionType>NPORT-P/A</submissionType>
<isConfidential>false</isConfidential>
<accessionNumber>0001145549-23-004025</accessionNumber>
<filerInfo>
<filer>
<issuerCredentials>
<cik>0001618627</cik>
<ccc>XXXXXXXX</ccc>
</issuerCredentials>
</filer>
<seriesClassInfo>
<seriesId>S000048029</seriesId>
<classId>C000151492</classId>
</seriesClassInfo>
</filerInfo>
</headerData>
    <formData>
        <genInfo>
        ...
        </genInfo>
        <fundInfo>
        ...
        </fundInfo>
        <invstOrSecs>
            <invstOrSec>
                <name>ARROW BIDCO LLC</name>
                <lei>549300YHZN08M0H3O128</lei>
                <title>Arrow Bidco LLC</title>
                <cusip>042728AA3</cusip>
                <identifiers>
                    <isin value="US042728AA35"/>
                </identifiers>
                <balance>115000.000000000000</balance>
                <units>PA</units>
                <curCd>USD</curCd>
                <valUSD>114754.170000000000</valUSD>
                <pctVal>0.3967552449</pctVal>
                <payoffProfile>Long</payoffProfile>
                <assetCat>DBT</assetCat>
                <issuerCat>CORP</issuerCat>
                <invCountry>US</invCountry>
                <isRestrictedSec>N</isRestrictedSec>
                <fairValLevel>2</fairValLevel>
                <debtSec>
                    <maturityDt>2024-03-15</maturityDt>
                    <couponKind>Fixed</couponKind>
                    <annualizedRt>9.500000000000</annualizedRt>
                    <isDefault>N</isDefault>
                    <areIntrstPmntsInArrs>N</areIntrstPmntsInArrs>
                    <isPaidKind>N</isPaidKind>
                </debtSec>
                <securityLending>
                    <isCashCollateral>N</isCashCollateral>
                    <isNonCashCollateral>N</isNonCashCollateral>
                    <isLoanByFund>N</isLoanByFund>
                </securityLending>
            </invstOrSec>

Arrow Bidco LLC是投资组合中的一只债券，它的一些特征包括在文件中（CUSIP、CIK、余额、到期日等）。我正在寻找迭代每只证券（investOrSec）并在 Dataframe 中收集每只证券特征的最佳方法。我目前使用的代码是：

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", "X-Requested-With": "XMLHttpRequest"}

n_port_file = requests.get("https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml", headers=header, verify=False)
n_port_file_xml = n_port_file.content
soup = BeautifulSoup(n_port_file_xml,'xml')

names = soup.find_all('name')
lei = soup.find_all('lei')
title = soup.find_all('title')
cusip = soup.find_all('cusip')
....
maturityDt = soup.find_all('maturityDt')
couponKind = soup.find_all('couponKind')
annualizedRt = soup.find_all('annualizedRt')

然后迭代每个列表以基于每行中的值创建 Dataframe 。

fixed_income_data = []
for i in range(0,len(names)):
    rows = [names[i].get_text(),lei[i].get_text(),
        title[i].get_text(),cusip[i].get_text(),
        balance[i].get_text(),units[i].get_text(),
        pctVal[i].get_text(),payoffProfile[i].get_text(),
        assetCat[i].get_text(),issuerCat[i].get_text(),
        invCountry[i].get_text(),couponKind[i].get_text()
        ]
    fixed_income_data.append(rows)

fixed_income_df = pd.DataFrame(equity_data,columns = ['name',
                         'lei',
                         'title',
                         'cusip',
                         'balance',
                         'units',
                         'pctVal',
                         'payoffProfile',
                         'assetCat',
                         'issuerCat',
                         'invCountry'
                         'maturityDt',
                         'couponKind',
                         'annualizedRt'
                         ], dtype = float)

当所有信息都包含在内时，这很好用，但经常有一个变量没有考虑在内。表单的一部分可能是空白的，或者发行人类别可能没有填错，导致IndexError。这个投资组合有127种证券，我能够解析，但可能缺少单个证券的年化回报率。从而摆脱了整洁地创建 Dataframe 的能力。
此外，对于同时持有固定收益和权益证券的投资组合，权益证券不会返回debtSecs子节点的信息。有没有办法在迭代这些数据的同时，以最简单的方式清理这些数据？即使为权益证券没有引用的debtSecs子节点添加“NaN”也是有效的响应。如有任何帮助，我们将不胜感激！[1]：https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml

在我看来，这是处理这个问题的最好方法。一般来说，埃德加文件是出了名的难以解析，所以下面的方法可能对其他文件有效，也可能无效，即使是来自同一个文件。
由于这是一个XML文件，为了方便您自己，您应该使用xml解析器和xpath。如果您希望创建一个 Dataframe ，最合适的工具是the pandas read_xml() method.
因为XML是嵌套的，所以您需要创建两个不同的 Dataframe 并将它们连接起来（也许其他人会更好地了解如何处理它）。最后，尽管read_xml()可以直接从url读取，但在本例中，埃德加需要使用用户代理，这意味着您还需要使用requests库。
所以，总而言之：

#import required libraries
import pandas as pd
import requests

url = 'https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml'
#set headers with a user-agent
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}    
req =  requests.get(url, headers=headers)

#define the columns you want to drop (based on the data in your question)
to_drop = ['identifiers', 'curCd','valUSD','isRestrictedSec','fairValLevel','debtSec','securityLending']

#the filing uses namespaces (too complicated to get into here), so you need to define that as well
namespaces = {"nport": "http://www.sec.gov/edgar/nport"}

#create the first df, for the securities which are debt instruments
invest = pd.read_xml(req.text,xpath="//nport:invstOrSec[.//nport:debtSec]",namespaces=namespaces).drop(to_drop, axis=1)

#crete the 2nd df, for the debt details:
debt = pd.read_xml(req.text,xpath="//nport:debtSec",namespaces=namespaces).iloc[:,0:3]

#finally, concatenate the two into one df:
pd.concat([invest, debt], axis=1)

这将输出126种债券（请原谅格式）：

lei     title   cusip   balance     units   pctVal  payoffProfile   assetCat    issuerCat   invCountry  maturityDt  couponKind  annualizedRt
0   ARROW BIDCO LLC     549300YHZN08M0H3O128    Arrow Bidco LLC     042728AA3   115000.00   PA  0.396755    Long    DBT     CORP    US  2024-03-15  Fixed   9.50000
1   CD&R SMOKEY BUYER INC   NaN     CD&R Smokey Buyer Inc   12510CAA9   165000.00   PA  0.505585    Long    DBT     CORP    US  2025-07-15  Fixed   6.75000

然后，您可以使用最终的df，添加或删除列等

python 使用BeautifulSoup解析带有子节点的SEC埃德加XML表单数据

1条答案

相关问题

热门标签

最新问答