BeautifulSoup:如何将变量传递给`soup.find(variable])`。

huangapple go评论57阅读模式
英文:

BeautifulSoup: How to pass a variable into soup.find({variable])

问题

I am using Beautiful Soup to search an XML file provided by the SEC (this is public data). Beautiful Soup works very well for referencing tags but I can not seem to pass a variable to its find function. Static content is fine. I think there is a gap in my python understanding that I can't seem to figure out. (I code a few days a year, not my main role)

File:
https://reports.adviserinfo.sec.gov/reports/CompilationReports/IA_FIRM_SEC_Feed_02_08_2023.xml.gz

I download, unzip and then create the soup from the file using lxml.

with open(Firm_Download_name, 'r') as f:
   soup = BeautifulSoup(f, 'lxml') 

Next is where I am running into trouble, I have a list of Firm CRD numbers (these are public numbers identifying the firm) that I am looking for in the XML file and then pulling out various data points from the child tags.

If I write it statically such as:

soup.find(firmcrdnb="5639055").parent

This works perfectly, but I want to loop through a list of CRD numbers and pull out a different block each time. I can not figure out how to pass a variable to the soup.find function.

I feel like this should be simple. I appreciate any help you can provide.

Here is my current attempt:

searchstring = 'firmcrdnb="' + Firm_CRD + '"'
select_firm = soup.find(searchstring).parent

I have tried other similar setups and reviewed other stack exchanges such as https://stackoverflow.com/questions/21352168/is-it-possible-to-pass-a-variable-to-beautifulsoup-soup-find but just not quite getting it.

Here is an example of the XML.

<?xml version="1.0" encoding="iso-8859-1"?>
<IAPDFirmSECReport GenOn="2017-09-30">
<Firms>
<Firm>
<Info SECRgnCD="MIRO" FirmCrdNb="9999" SECNb="999-99999" BusNm="XXXX INC." LegalNm="XXX INC" UmbrRgstn="N"/>
<MainAddr Strt1="9999 XXXX" Strt2="XXXX" City="XXX" State="FL" Cntry="XXX" PostlCd="999999" PhNb="999-999-9999" FaxNb="999-999-9999"/>
<MailingAddr Strt1="9999 XXXX" Strt2="XXXX" City="XXX" State="FL" Cntry="XXX" PostlCd="999999"/>
<Rgstn FirmType="Registered" St="APPROVED" Dt="9999-01-01"/>
<NoticeFiled>

ps: if anyone has ideas on how to improve the speed of the search on this large file I'd appreciate that too. I get messages such as "pydevd warning: Computing repr of soup (BeautifulSoup) was slow (took 43.83s)" I did install and import chardet per the beautifulsoup documentation but that hasn't seemed to help.

英文:

I am using Beautiful Soup to search an XML file provided by the SEC (this is public data). Beautiful Soup works very well for referencing tags but I can not seem to pass a variable to its find function. Static content is fine. I think there is a gap in my python understanding that I can't seem to figure out. (I code a few days a year, not my main role)

File:
https://reports.adviserinfo.sec.gov/reports/CompilationReports/IA_FIRM_SEC_Feed_02_08_2023.xml.gz

I download, unzip and then create the soup from the file using lxml.

with open(Firm_Download_name,&#39;r&#39;) as f:
   soup = BeautifulSoup(f, &#39;lxml&#39;) 

Next is where I am running into trouble, I have a list of Firm CRD numbers (these are public numbers identifying the firm) that I am looking for in the XML file and then pulling out various data points from the child tags.

If I write it statically such as:

soup.find(firmcrdnb=&quot;5639055&quot;).parent

This works perfectly, but I want to loop through a list of CRD numbers and pull out a different block each time. I can not figure out how to pass a variable to the soup.find function.

I feel like this should be simple. I appreciate any help you can provide.

Here is my current attempt:

searchstring = &#39;firmcrdnb=&quot;&#39;+Firm_CRD+&#39;&quot;&#39;
select_firm = soup.find(searchstring).parent

I have tried other similar setups and reviewed other stack exchanges such as https://stackoverflow.com/questions/21352168/is-it-possible-to-pass-a-variable-to-beautifulsoup-soup-find but just not quite getting it.

Here is an example of the XML.

&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot;?&gt;
&lt;IAPDFirmSECReport GenOn=&quot;2017-09-30&quot;&gt;
&lt;Firms&gt;
&lt;Firm&gt;
&lt;Info SECRgnCD=&quot;MIRO&quot; FirmCrdNb=&quot;9999&quot; SECNb=&quot;999-99999&quot; BusNm=&quot;XXXX INC.&quot; LegalNm=&quot;XXX INC&quot; UmbrRgstn=&quot;N&quot;/&gt;
&lt;MainAddr Strt1=&quot;9999 XXXX&quot; Strt2=&quot;XXXX&quot; City=&quot;XXX&quot; State=&quot;FL&quot; Cntry=&quot;XXX&quot; PostlCd=&quot;999999&quot; PhNb=&quot;999-999-9999&quot; FaxNb=&quot;999-999-9999&quot;/&gt;
&lt;MailingAddr Strt1=&quot;9999 XXXX&quot; Strt2=&quot;XXXX&quot; City=&quot;XXX&quot; State=&quot;FL&quot; Cntry=&quot;XXX&quot; PostlCd=&quot;999999&quot; /&gt;
&lt;Rgstn FirmType=&quot;Registered&quot; St=&quot;APPROVED&quot; Dt=&quot;9999-01-01&quot;/&gt;
&lt;NoticeFiled&gt;

Thanks

ps: if anyone has ideas on how to improve the speed of the search on this large file I'd appreciate that to. I get messages such as "pydevd warning: Computing repr of soup (BeautifulSoup) was slow (took 43.83s)" I did install and import chardet per the beautifulsoup documentation but that hasn't seemed to help.

答案1

得分: 1

我不确定我是在哪里弄错的,但我的固定答案实际上没有起作用。

标签是"info",属性是"firmcrdnb"。

有效的答案是:

select_firm = soup.find("info", {"firmcrdnb": Firm_CRD}).parent
英文:

I'm not sure where I got turned around but my static answer did in fact not work.

The tag is "info" and the attribute is "firmcrdnb".

The answer that works was:

select_firm = soup.find(&quot;info&quot;, {&quot;firmcrdnb&quot; : Firm_CRD}).parent

答案2

得分: 0

尝试使用,

select_firm = soup.find(attrs={'firmcrdnb': str(Firm_CRD)}).parent

英文:

Try use,

select_firm = soup.find(attrs={&#39;firmcrdnb&#39;: str(Firm_CRD)}).parent

答案3

得分: 0

也许我漏掉了一些东西。如果它在静态情况下起作用,您是否尝试过像这样的东西:

list_of_crds = ["11111", "22222", "33333"]

for crd in list_of_crds:
    result = soup.find(firmcrdnb=crd).parent
    ...
英文:

Maybe I'm missing something. If it works statically, have you tried something such as:

list_of_crds = [&quot;11111&quot;,&quot;22222&quot;,&quot;33333&quot;]

for crd in list_of_crds:
    result = soup.find(firmcrdnb=crd).parent
    ...

huangapple
  • 本文由 发表于 2023年2月9日 03:23:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75390782.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定