如何使用BeautifulSoup4从HTML中抓取id属性?

huangapple go评论59阅读模式
英文:

How to scrape id attribute from HTML with bs4?

问题

from bs4 import BeautifulSoup as soup

html = '''
<p transform="translate(3,15)" class="SoccerPlayer SoccerPlayer-1 Soccer-Team  Outcome-Positive" id="12-8-3">
<p transform="translate(89,20)" class="SoccerPlayer SoccerPlayer-514 Soccer-Team Outcome-Positive" data-id="12-9-229">
'''

parsed_html = soup(html, 'html.parser')
for pr in parsed_html.find_all('p'):
    print(pr.get('class'), pr.get('id'))
英文:

I'm trying to scrape a specific piece of data from HTML.

html = '''<p transform="translate(3,15)" class="SoccerPlayer SoccerPlayer-1 Soccer-Team  Outcome-Positive" id="12-8-3">
<p transform="translate(89,20)" class="SoccerPlayer SoccerPlayer-514 Soccer-Team Outcome-Positive" data-id="12-9-229">'''

From this piece of html I'm attempting to scrape the class, and id attributes.

I've tried

from bs4 import BeautifulSoup as soup
for pr in soup.find_all("p"):
    print(pr['class'], pr['id'])

but I get a keyerror on id.

答案1

得分: 1

你的代码尝试在未初始化BeautifulSoup实例之前使用find_all()方法:

from bs4 import BeautifulSoup 

html_data = '''<p transform="translate(3,15)" class="SoccerPlayer SoccerPlayer-1 Soccer-Team  Outcome-Positive" id="12-8-3"'''

soup = BeautifulSoup(html_data, 'html.parser')

for pr in soup.find_all("p"):
    print(pr["class"], pr["id"])
英文:

Your code is trying to use the find_all() method without first initializing an instance of BeautifulSoup:

from bs4 import BeautifulSoup 

html_data = &#39;&#39;&#39;&lt;p transform=&quot;translate(3,15)&quot; class=&quot;SoccerPlayer SoccerPlayer-1 Soccer-Team  Outcome-Positive&quot; id=&quot;12-8-3&quot;&#39;&#39;&#39;

soup = BeautifulSoup(html_data, &#39;html.parser&#39;)

for pr in soup.find_all(&quot;p&quot;):
    print(pr[&quot;class&quot;], pr[&quot;id&quot;])

答案2

得分: 1

问题在于第二个元素没有属性id,只有一个data-id,所以您必须检查那个或者使用.get()来确保属性是否被定义:

pr.get('id')

示例

from bs4 import BeautifulSoup 

html = '''<p transform="translate(3,15)" class="SoccerPlayer SoccerPlayer-1 Soccer-Team  Outcome-Positive" id="12-8-3">
<p transform="translate(89,20)" class="SoccerPlayer SoccerPlayer-514 Soccer-Team Outcome-Positive" data-id="12-9-229">'''
soup = BeautifulSoup(html, 'html.parser')

for pr in soup.find_all("p"):
    print(pr["class"], pr.get('id'))

输出

['SoccerPlayer', 'SoccerPlayer-1', 'Soccer-Team', 'Outcome-Positive'] 12-8-3
['SoccerPlayer', 'SoccerPlayer-514', 'Soccer-Team', 'Outcome-Positive'] None

一个笨拙的替代方法是遍历属性并搜索包含id的任何属性:

print(pr["class"], pr.get([a for a in pr.attrs if 'id' in a][0]))

-&gt;
['SoccerPlayer', 'SoccerPlayer-1', 'Soccer-Team', 'Outcome-Positive'] 12-8-3
['SoccerPlayer', 'SoccerPlayer-514', 'Soccer-Team', 'Outcome-Positive'] 12-9-229
英文:

Issue here is that the second element do not have an attribute id, there is only a data-id, so you have to check that or use .get() if you’re not sure an attribute is defined:

pr.get(&#39;id&#39;)

Example

from bs4 import BeautifulSoup 

html = &#39;&#39;&#39;&lt;p transform=&quot;translate(3,15)&quot; class=&quot;SoccerPlayer SoccerPlayer-1 Soccer-Team  Outcome-Positive&quot; id=&quot;12-8-3&quot;&gt;
&lt;p transform=&quot;translate(89,20)&quot; class=&quot;SoccerPlayer SoccerPlayer-514 Soccer-Team Outcome-Positive&quot; data-id=&quot;12-9-229&quot;&gt;&#39;&#39;&#39;
soup = BeautifulSoup(html, &#39;html.parser&#39;)

for pr in soup.find_all(&quot;p&quot;):
    print(pr[&quot;class&quot;], pr.get(&#39;id&#39;))

Output

[&#39;SoccerPlayer&#39;, &#39;SoccerPlayer-1&#39;, &#39;Soccer-Team&#39;, &#39;Outcome-Positive&#39;] 12-8-3
[&#39;SoccerPlayer&#39;, &#39;SoccerPlayer-514&#39;, &#39;Soccer-Team&#39;, &#39;Outcome-Positive&#39;] None

An ugly alternative is to iterate the attributes and search for any attribute contains id:

print(pr[&quot;class&quot;], pr.get([a for a in pr.attrs if &#39;id&#39; in a][0]))

-&gt;
[&#39;SoccerPlayer&#39;, &#39;SoccerPlayer-1&#39;, &#39;Soccer-Team&#39;, &#39;Outcome-Positive&#39;] 12-8-3
[&#39;SoccerPlayer&#39;, &#39;SoccerPlayer-514&#39;, &#39;Soccer-Team&#39;, &#39;Outcome-Positive&#39;] 12-9-229

huangapple
  • 本文由 发表于 2023年2月19日 21:47:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75500572.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定