2023年2月7日 03:47:06go评论146阅读模式

英文:

Parse SEC EDGAR XML Form Data with child nodes using BeautifulSoup

问题

我正在尝试使用Beautiful Soup和XML从SEC的N-PORT-P/A表单中抓取单独的基金持仓。一个典型的提交，如下所示，并在此处链接，外观如下：

&lt;edgarSubmission xmlns=&quot;http://www.sec.gov/edgar/nport&quot; xmlns:com=&quot;http://www.sec.gov/edgar/common&quot; xmlns:ncom=&quot;http://www.sec.gov/edgar/nportcommon&quot; xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot;&gt;
&lt;headerData&gt;
&lt;submissionType&gt;NPORT-P/A&lt;/submissionType&gt;
&lt;isConfidential&gt;false&lt;/isConfidential&gt;
&lt;accessionNumber&gt;0001145549-23-004025&lt;/accessionNumber&gt;
&lt;filerInfo&gt;
&lt;filer&gt;
&lt;issuerCredentials&gt;
&lt;cik&gt;0001618627&lt;/cik&gt;
&lt;ccc&gt;XXXXXXXX&lt;/ccc&gt;
&lt;/issuerCredentials&gt;
&lt;/filer&gt;
&lt;seriesClassInfo&gt;
&lt;seriesId&gt;S000048029&lt;/seriesId&gt;
&lt;classId&gt;C000151492&lt;/classId&gt;
&lt;/seriesClassInfo&gt;
&lt;/filerInfo&gt;
&lt;/headerData&gt;
    &lt;formData&gt;
        &lt;genInfo&gt;
        ...
        &lt;/genInfo&gt;
        &lt;fundInfo&gt;
        ...
        &lt;/fundInfo&gt;
        &lt;invstOrSecs&gt;
            &lt;invstOrSec&gt;
                &lt;name&gt;ARROW BIDCO LLC&lt;/name&gt;
                &lt;lei&gt;549300YHZN08M0H3O128&lt;/lei&gt;
                &lt;title&gt;Arrow Bidco LLC&lt;/title&gt;
                &lt;cusip&gt;042728AA3&lt;/cusip&gt;
                &lt;identifiers&gt;
                    &lt;isin value=&quot;US042728AA35&quot;/&gt;
                &lt;/identifiers&gt;
                &lt;balance&gt;115000.000000000000&lt;/balance&gt;
                &lt;units&gt;PA&lt;/units&gt;
                &lt;curCd&gt;USD&lt;/curCd&gt;
                &lt;valUSD&gt;114754.170000000000&lt;/valUSD&gt;
                &lt;pctVal&gt;0.3967552449&lt;/pctVal&gt;
                &lt;payoffProfile&gt;Long&lt;/payoffProfile&gt;
                &lt;assetCat&gt;DBT&lt;/assetCat&gt;
                &lt;issuerCat&gt;CORP&lt;/issuerCat&gt;
                &lt;invCountry&gt;US&lt;/invCountry&gt;
                &lt;isRestrictedSec&gt;N&lt;/isRestrictedSec&gt;
                &lt;fairValLevel&gt;2&lt;/fairValLevel&gt;
                &lt;debtSec&gt;
                    &lt;maturityDt&gt;2024-03-15&lt;/maturityDt&gt;
                    &lt;couponKind&gt;Fixed&lt;/couponKind&gt;
                    &lt;annualizedRt&gt;9.500000000000&lt;/annualizedRt&gt;
                    &lt;isDefault&gt;N&lt;/isDefault&gt;
                    &lt;areIntrstPmntsInArrs&gt;N&lt;/areIntrstPmntsInArrs&gt;
                    &lt;isPaidKind&gt;N&lt;/isPaidKind&gt;
                &lt;/debtSec&gt;
                &lt;securityLending&gt;
                    &lt;isCashCollateral&gt;N&lt;/isCashCollateral&gt;
                    &lt;isNonCashCollateral&gt;N&lt;/isNonCashCollateral&gt;
                    &lt;isLoanByFund&gt;N&lt;/isLoanByFund&gt;
                &lt;/securityLending&gt;
            &lt;/invstOrSec&gt;
        &lt;/invstOrSecs&gt;
    &lt;/formData&gt;
&lt;/edgarSubmission&gt;

Arrow Bidco LLC是投资组合内的债券，其中包含在提交中的某些特征（CUSIP、CIK、余额、到期日等）。我正在寻找通过每个个别安全性（investOrSec）进行迭代并将每个安全性的特征收集到数据框中的最佳方法。

我当前正在使用的代码是：

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

header = {&quot;User-Agent&quot;: &quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36&quot;, &quot;X-Requested-With&quot;: &quot;XMLHttpRequest&quot;}

n_port_file = requests.get(&quot;https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml&quot;, headers=header, verify=False)
n_port_file_xml = n_port_file.content
soup = BeautifulSoup(n_port_file_xml,&#39;xml&#39;)

names = soup.find_all(&#39;name&#39;)
lei = soup.find_all(&#39;lei&#39;)
title = soup.find_all(&#39;title&#39;)
cusip = soup.find_all(&#39;cusip&#39;)
....
maturityDt = soup.find_all(&#39;maturityDt&#39;)
couponKind = soup.find_all(&#39;couponKind&#39;)
annualizedRt = soup.find_all(&#39;annualizedRt&#39;)

然后遍历每个列表，根据每行的值创建数据框。

fixed_income_data = []
for i in range(0,len(names)):
    rows = [names[i].get_text(),lei[i].get_text(),
        title[i].get_text(),cusip[i].get_text(),
        balance[i].get_text(),units[i].get_text(),
        pctVal[i].get_text(),payoffProfile[i].get_text(),
        assetCat[i].get_text(),issuerCat[i].get_text(),
        invCountry[i].get_text(),couponKind[i].get_text()
        ]
    fixed_income_data.append(rows)

fixed_income_df = pd.DataFrame(equity_data,columns = [&#39;name&#39;,
                         &#39;lei&#39;,
                         &#39;title&#39;,
                         &#39;cusip

英文:

I am attempting to scrape individual fund holdings from the SEC's N-PORT-P/A form using beautiful soup and xml. A typical submission, outlined below and linked here, looks like:

&lt;edgarSubmission xmlns=&quot;http://www.sec.gov/edgar/nport&quot; xmlns:com=&quot;http://www.sec.gov/edgar/common&quot; xmlns:ncom=&quot;http://www.sec.gov/edgar/nportcommon&quot; xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot;&gt;
&lt;headerData&gt;
&lt;submissionType&gt;NPORT-P/A&lt;/submissionType&gt;
&lt;isConfidential&gt;false&lt;/isConfidential&gt;
&lt;accessionNumber&gt;0001145549-23-004025&lt;/accessionNumber&gt;
&lt;filerInfo&gt;
&lt;filer&gt;
&lt;issuerCredentials&gt;
&lt;cik&gt;0001618627&lt;/cik&gt;
&lt;ccc&gt;XXXXXXXX&lt;/ccc&gt;
&lt;/issuerCredentials&gt;
&lt;/filer&gt;
&lt;seriesClassInfo&gt;
&lt;seriesId&gt;S000048029&lt;/seriesId&gt;
&lt;classId&gt;C000151492&lt;/classId&gt;
&lt;/seriesClassInfo&gt;
&lt;/filerInfo&gt;
&lt;/headerData&gt;
    &lt;formData&gt;
        &lt;genInfo&gt;
        ...
        &lt;/genInfo&gt;
        &lt;fundInfo&gt;
        ...
        &lt;/fundInfo&gt;
        &lt;invstOrSecs&gt;
            &lt;invstOrSec&gt;
                &lt;name&gt;ARROW BIDCO LLC&lt;/name&gt;
                &lt;lei&gt;549300YHZN08M0H3O128&lt;/lei&gt;
                &lt;title&gt;Arrow Bidco LLC&lt;/title&gt;
                &lt;cusip&gt;042728AA3&lt;/cusip&gt;
                &lt;identifiers&gt;
                    &lt;isin value=&quot;US042728AA35&quot;/&gt;
                &lt;/identifiers&gt;
                &lt;balance&gt;115000.000000000000&lt;/balance&gt;
                &lt;units&gt;PA&lt;/units&gt;
                &lt;curCd&gt;USD&lt;/curCd&gt;
                &lt;valUSD&gt;114754.170000000000&lt;/valUSD&gt;
                &lt;pctVal&gt;0.3967552449&lt;/pctVal&gt;
                &lt;payoffProfile&gt;Long&lt;/payoffProfile&gt;
                &lt;assetCat&gt;DBT&lt;/assetCat&gt;
                &lt;issuerCat&gt;CORP&lt;/issuerCat&gt;
                &lt;invCountry&gt;US&lt;/invCountry&gt;
                &lt;isRestrictedSec&gt;N&lt;/isRestrictedSec&gt;
                &lt;fairValLevel&gt;2&lt;/fairValLevel&gt;
                &lt;debtSec&gt;
                    &lt;maturityDt&gt;2024-03-15&lt;/maturityDt&gt;
                    &lt;couponKind&gt;Fixed&lt;/couponKind&gt;
                    &lt;annualizedRt&gt;9.500000000000&lt;/annualizedRt&gt;
                    &lt;isDefault&gt;N&lt;/isDefault&gt;
                    &lt;areIntrstPmntsInArrs&gt;N&lt;/areIntrstPmntsInArrs&gt;
                    &lt;isPaidKind&gt;N&lt;/isPaidKind&gt;
                &lt;/debtSec&gt;
                &lt;securityLending&gt;
                    &lt;isCashCollateral&gt;N&lt;/isCashCollateral&gt;
                    &lt;isNonCashCollateral&gt;N&lt;/isNonCashCollateral&gt;
                    &lt;isLoanByFund&gt;N&lt;/isLoanByFund&gt;
                &lt;/securityLending&gt;
            &lt;/invstOrSec&gt;

With Arrow Bidco LLC being a bond within the portfolio, with some of its characteristics included within the filing (CUSIP, CIK, balance, maturity date, etc.). I am looking for the best way to iterate through each individual security (investOrSec) and collect the characteristics of each security in a dataframe.
The code I am currently using is:

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

header = {&quot;User-Agent&quot;: &quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36&quot;, &quot;X-Requested-With&quot;: &quot;XMLHttpRequest&quot;}

n_port_file = requests.get(&quot;https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml&quot;, headers=header, verify=False)
n_port_file_xml = n_port_file.content
soup = BeautifulSoup(n_port_file_xml,&#39;xml&#39;)

names = soup.find_all(&#39;name&#39;)
lei = soup.find_all(&#39;lei&#39;)
title = soup.find_all(&#39;title&#39;)
cusip = soup.find_all(&#39;cusip&#39;)
....
maturityDt = soup.find_all(&#39;maturityDt&#39;)
couponKind = soup.find_all(&#39;couponKind&#39;)
annualizedRt = soup.find_all(&#39;annualizedRt&#39;)

Then iterating through each list to create a dataframe based on the values in each row.

fixed_income_data = []
for i in range(0,len(names)):
    rows = [names[i].get_text(),lei[i].get_text(),
        title[i].get_text(),cusip[i].get_text(),
        balance[i].get_text(),units[i].get_text(),
        pctVal[i].get_text(),payoffProfile[i].get_text(),
        assetCat[i].get_text(),issuerCat[i].get_text(),
        invCountry[i].get_text(),couponKind[i].get_text()
        ]
    fixed_income_data.append(rows)

fixed_income_df = pd.DataFrame(equity_data,columns = [&#39;name&#39;,
                         &#39;lei&#39;,
                         &#39;title&#39;,
                         &#39;cusip&#39;,
                         &#39;balance&#39;,
                         &#39;units&#39;,
                         &#39;pctVal&#39;,
                         &#39;payoffProfile&#39;,
                         &#39;assetCat&#39;,
                         &#39;issuerCat&#39;,
                         &#39;invCountry&#39;
                         &#39;maturityDt&#39;,
                         &#39;couponKind&#39;,
                         &#39;annualizedRt&#39;
                         ], dtype = float)

This works fine when all pieces of information are included, but often there is one variable that is not accounted for. A piece of the form might be blank, or an issuer category might not have been filled out incorrectly, leading to an IndexError. This portfolio has 127 securities that I was able to parse, but might be missing an annualized return for a single security, throwing off the ability to neatly create a dataframe.

Additionally, for portfolios that hold both fixed income and equity securities, the equity securities do not return information for the debtSecs child. Is there a way to iterate through this data while simultaneously cleaning it in the easiest way possible? Even adding "NaN" for the debtSec children that equity securities don't reference would be a valid response. Any help would be much appreciated!
1: https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml

答案1

得分: 1

以下是我认为处理这个问题的最佳方式。一般来说，EDGAR提交文件往往很难解析，所以下面的方法可能适用于其他提交文件，即使是来自同一提交者也可能不一定适用。

为了让操作更容易，由于这是一个XML文件，你应该使用XML解析器和XPath。考虑到你要创建一个数据框，最合适的工具应该是pandas的read_xml()方法。

因为XML是嵌套的，你需要创建两个不同的数据框，然后将它们连接起来（也许其他人会有更好的方法来处理）。最后，虽然read_xml()可以直接从URL读取，但在这种情况下，EDGAR要求使用用户代理，这意味着你还需要使用requests库。

所以，总结一下：

# 导入所需的库
import pandas as pd
import requests

url = 'https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml'
# 设置带有用户代理的标头
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
req = requests.get(url, headers=headers)

# 定义要删除的列（根据你问题中的数据）
to_drop = ['identifiers', 'curCd','valUSD','isRestrictedSec','fairValLevel','debtSec','securityLending']

# 提交使用了命名空间（这里不方便详细讨论），所以你需要定义它
namespaces = {"nport": "http://www.sec.gov/edgar/nport"}

# 创建第一个数据框，用于债务工具
invest = pd.read_xml(req.text, xpath="//nport:invstOrSec[.//nport:debtSec]", namespaces=namespaces).drop(to_drop, axis=1)

# 创建第二个数据框，用于债务详细信息：
debt = pd.read_xml(req.text, xpath="//nport:debtSec", namespaces=namespaces).iloc[:, 0:3]

# 最后，将两者合并成一个数据框：
pd.concat([invest, debt], axis=1)

这应该会输出你的126个债务工具（请原谅格式）：

lei 	title 	cusip 	balance 	units 	pctVal 	payoffProfile 	assetCat 	issuerCat 	invCountry 	maturityDt 	couponKind 	annualizedRt
0 	ARROW BIDCO LLC 	549300YHZN08M0H3O128 	Arrow Bidco LLC 	042728AA3 	115000.00 	PA 	0.396755 	Long 	DBT 	CORP 	US 	2024-03-15 	Fixed 	9.50000
1 	CD&amp;R SMOKEY BUYER INC 	NaN 	CD&amp;R Smokey Buyer Inc 	12510CAA9 	165000.00 	PA 	0.505585 	Long 	DBT 	CORP 	US 	2025-07-15 	Fixed 	6.75000

然后，你可以对最终的数据框进行操作，添加或删除列等。

英文:

Here is the best way, in my opinion, to handle the problem. Generally speaking, EDGAR filings are notoriously difficult to parse, so the following may or may not work on other filings, even from the same filer.

To make it easier on yourself, since this is an XML file, you should use an xml parser and xpath. Given that you're looking to create a dataframe, the most appropriate tool would be the pandas read_xml() method.

Because the XML is nested, you will need to create two different dataframes and concatenate them (maybe others will have a better idea on how to approach it). And, finally, although read_xml() can read directly from a url, in this case, EDGAR requires using a user-agent, meaning you also need to use the requests library as well.

So, all together:

#import required libraries
import pandas as pd
import requests

url = &#39;https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml&#39;
#set headers with a user-agent
headers = {&quot;User-agent&quot;:&quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36&quot;}    
req =  requests.get(url, headers=headers)

#define the columns you want to drop (based on the data in your question)
to_drop = [&#39;identifiers&#39;, &#39;curCd&#39;,&#39;valUSD&#39;,&#39;isRestrictedSec&#39;,&#39;fairValLevel&#39;,&#39;debtSec&#39;,&#39;securityLending&#39;]

#the filing uses namespaces (too complicated to get into here), so you need to define that as well
namespaces = {&quot;nport&quot;: &quot;http://www.sec.gov/edgar/nport&quot;}

#create the first df, for the securities which are debt instruments
invest = pd.read_xml(req.text,xpath=&quot;//nport:invstOrSec[.//nport:debtSec]&quot;,namespaces=namespaces).drop(to_drop, axis=1)

#crete the 2nd df, for the debt details:
debt = pd.read_xml(req.text,xpath=&quot;//nport:debtSec&quot;,namespaces=namespaces).iloc[:,0:3]

#finally, concatenate the two into one df:
pd.concat([invest, debt], axis=1)

This should output your 126 debt securities (pardon the formatting):

lei 	title 	cusip 	balance 	units 	pctVal 	payoffProfile 	assetCat 	issuerCat 	invCountry 	maturityDt 	couponKind 	annualizedRt
0 	ARROW BIDCO LLC 	549300YHZN08M0H3O128 	Arrow Bidco LLC 	042728AA3 	115000.00 	PA 	0.396755 	Long 	DBT 	CORP 	US 	2024-03-15 	Fixed 	9.50000
1 	CD&amp;R SMOKEY BUYER INC 	NaN 	CD&amp;R Smokey Buyer Inc 	12510CAA9 	165000.00 	PA 	0.505585 	Long 	DBT 	CORP 	US 	2025-07-15 	Fixed 	6.75000

You can then play with the final df, add or drop columns, etc

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用BeautifulSoup解析SEC EDGAR XML表单数据及其子节点

问题

答案1

在BeautifulSoup中格式化html_text。

Equivalent of python's utils.execute() in golang

psutil在Python中删除类实例后为什么不能准确表示可用内存？

`psutil.net_io_counters().byte_recv` 的确切含义是什么？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论