英文:
BeautifulSoup: How to pass a variable into soup.find({variable])
问题
I am using Beautiful Soup to search an XML file provided by the SEC (this is public data). Beautiful Soup works very well for referencing tags but I can not seem to pass a variable to its find function. Static content is fine. I think there is a gap in my python understanding that I can't seem to figure out. (I code a few days a year, not my main role)
File:
https://reports.adviserinfo.sec.gov/reports/CompilationReports/IA_FIRM_SEC_Feed_02_08_2023.xml.gz
I download, unzip and then create the soup from the file using lxml.
with open(Firm_Download_name, 'r') as f:
soup = BeautifulSoup(f, 'lxml')
Next is where I am running into trouble, I have a list of Firm CRD numbers (these are public numbers identifying the firm) that I am looking for in the XML file and then pulling out various data points from the child tags.
If I write it statically such as:
soup.find(firmcrdnb="5639055").parent
This works perfectly, but I want to loop through a list of CRD numbers and pull out a different block each time. I can not figure out how to pass a variable to the soup.find function.
I feel like this should be simple. I appreciate any help you can provide.
Here is my current attempt:
searchstring = 'firmcrdnb="' + Firm_CRD + '"'
select_firm = soup.find(searchstring).parent
I have tried other similar setups and reviewed other stack exchanges such as https://stackoverflow.com/questions/21352168/is-it-possible-to-pass-a-variable-to-beautifulsoup-soup-find but just not quite getting it.
Here is an example of the XML.
<?xml version="1.0" encoding="iso-8859-1"?>
<IAPDFirmSECReport GenOn="2017-09-30">
<Firms>
<Firm>
<Info SECRgnCD="MIRO" FirmCrdNb="9999" SECNb="999-99999" BusNm="XXXX INC." LegalNm="XXX INC" UmbrRgstn="N"/>
<MainAddr Strt1="9999 XXXX" Strt2="XXXX" City="XXX" State="FL" Cntry="XXX" PostlCd="999999" PhNb="999-999-9999" FaxNb="999-999-9999"/>
<MailingAddr Strt1="9999 XXXX" Strt2="XXXX" City="XXX" State="FL" Cntry="XXX" PostlCd="999999"/>
<Rgstn FirmType="Registered" St="APPROVED" Dt="9999-01-01"/>
<NoticeFiled>
ps: if anyone has ideas on how to improve the speed of the search on this large file I'd appreciate that too. I get messages such as "pydevd warning: Computing repr of soup (BeautifulSoup) was slow (took 43.83s)" I did install and import chardet per the beautifulsoup documentation but that hasn't seemed to help.
英文:
I am using Beautiful Soup to search an XML file provided by the SEC (this is public data). Beautiful Soup works very well for referencing tags but I can not seem to pass a variable to its find function. Static content is fine. I think there is a gap in my python understanding that I can't seem to figure out. (I code a few days a year, not my main role)
File:
https://reports.adviserinfo.sec.gov/reports/CompilationReports/IA_FIRM_SEC_Feed_02_08_2023.xml.gz
I download, unzip and then create the soup from the file using lxml.
with open(Firm_Download_name,'r') as f:
soup = BeautifulSoup(f, 'lxml')
Next is where I am running into trouble, I have a list of Firm CRD numbers (these are public numbers identifying the firm) that I am looking for in the XML file and then pulling out various data points from the child tags.
If I write it statically such as:
soup.find(firmcrdnb="5639055").parent
This works perfectly, but I want to loop through a list of CRD numbers and pull out a different block each time. I can not figure out how to pass a variable to the soup.find function.
I feel like this should be simple. I appreciate any help you can provide.
Here is my current attempt:
searchstring = 'firmcrdnb="'+Firm_CRD+'"'
select_firm = soup.find(searchstring).parent
I have tried other similar setups and reviewed other stack exchanges such as https://stackoverflow.com/questions/21352168/is-it-possible-to-pass-a-variable-to-beautifulsoup-soup-find but just not quite getting it.
Here is an example of the XML.
<?xml version="1.0" encoding="iso-8859-1"?>
<IAPDFirmSECReport GenOn="2017-09-30">
<Firms>
<Firm>
<Info SECRgnCD="MIRO" FirmCrdNb="9999" SECNb="999-99999" BusNm="XXXX INC." LegalNm="XXX INC" UmbrRgstn="N"/>
<MainAddr Strt1="9999 XXXX" Strt2="XXXX" City="XXX" State="FL" Cntry="XXX" PostlCd="999999" PhNb="999-999-9999" FaxNb="999-999-9999"/>
<MailingAddr Strt1="9999 XXXX" Strt2="XXXX" City="XXX" State="FL" Cntry="XXX" PostlCd="999999" />
<Rgstn FirmType="Registered" St="APPROVED" Dt="9999-01-01"/>
<NoticeFiled>
Thanks
ps: if anyone has ideas on how to improve the speed of the search on this large file I'd appreciate that to. I get messages such as "pydevd warning: Computing repr of soup (BeautifulSoup) was slow (took 43.83s)" I did install and import chardet per the beautifulsoup documentation but that hasn't seemed to help.
答案1
得分: 1
我不确定我是在哪里弄错的,但我的固定答案实际上没有起作用。
标签是"info",属性是"firmcrdnb"。
有效的答案是:
select_firm = soup.find("info", {"firmcrdnb": Firm_CRD}).parent
英文:
I'm not sure where I got turned around but my static answer did in fact not work.
The tag is "info" and the attribute is "firmcrdnb".
The answer that works was:
select_firm = soup.find("info", {"firmcrdnb" : Firm_CRD}).parent
答案2
得分: 0
尝试使用,
select_firm = soup.find(attrs={'firmcrdnb': str(Firm_CRD)}).parent
英文:
Try use,
select_firm = soup.find(attrs={'firmcrdnb': str(Firm_CRD)}).parent
答案3
得分: 0
也许我漏掉了一些东西。如果它在静态情况下起作用,您是否尝试过像这样的东西:
list_of_crds = ["11111", "22222", "33333"]
for crd in list_of_crds:
result = soup.find(firmcrdnb=crd).parent
...
英文:
Maybe I'm missing something. If it works statically, have you tried something such as:
list_of_crds = ["11111","22222","33333"]
for crd in list_of_crds:
result = soup.find(firmcrdnb=crd).parent
...
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论