2023年2月19日 21:47:31go评论92阅读模式

英文:

How to scrape id attribute from HTML with bs4?

问题

from bs4 import BeautifulSoup as soup
html = '''
&lt;p transform=&quot;translate(3,15)&quot; class=&quot;SoccerPlayer SoccerPlayer-1 Soccer-Team  Outcome-Positive&quot; id=&quot;12-8-3&quot;&gt;
&lt;p transform=&quot;translate(89,20)&quot; class=&quot;SoccerPlayer SoccerPlayer-514 Soccer-Team Outcome-Positive&quot; data-id=&quot;12-9-229&quot;&gt;
'''
parsed_html = soup(html, 'html.parser')
for pr in parsed_html.find_all('p'):
    print(pr.get('class'), pr.get('id'))

英文:

I'm trying to scrape a specific piece of data from HTML.

html = &#39;&#39;&#39;&lt;p transform=&quot;translate(3,15)&quot; class=&quot;SoccerPlayer SoccerPlayer-1 Soccer-Team  Outcome-Positive&quot; id=&quot;12-8-3&quot;&gt;
&lt;p transform=&quot;translate(89,20)&quot; class=&quot;SoccerPlayer SoccerPlayer-514 Soccer-Team Outcome-Positive&quot; data-id=&quot;12-9-229&quot;&gt;&#39;&#39;&#39;

From this piece of html I'm attempting to scrape the class, and id attributes.

I've tried

from bs4 import BeautifulSoup as soup
for pr in soup.find_all(&quot;p&quot;):
    print(pr[&#39;class&#39;], pr[&#39;id&#39;])

but I get a keyerror on id.

答案1

得分: 1

你的代码尝试在未初始化BeautifulSoup实例之前使用find_all()方法：

from bs4 import BeautifulSoup 
html_data = '''<p transform="translate(3,15)" class="SoccerPlayer SoccerPlayer-1 Soccer-Team  Outcome-Positive" id="12-8-3"'''
soup = BeautifulSoup(html_data, 'html.parser')
for pr in soup.find_all("p"):
    print(pr["class"], pr["id"])

英文:

Your code is trying to use the find_all() method without first initializing an instance of BeautifulSoup:

from bs4 import BeautifulSoup 
html_data = &#39;&#39;&#39;&lt;p transform=&quot;translate(3,15)&quot; class=&quot;SoccerPlayer SoccerPlayer-1 Soccer-Team  Outcome-Positive&quot; id=&quot;12-8-3&quot;&#39;&#39;&#39;
soup = BeautifulSoup(html_data, &#39;html.parser&#39;)
for pr in soup.find_all(&quot;p&quot;):
    print(pr[&quot;class&quot;], pr[&quot;id&quot;])

答案2

得分: 1

问题在于第二个元素没有属性id，只有一个data-id，所以您必须检查那个或者使用.get()来确保属性是否被定义：

pr.get('id')

示例

from bs4 import BeautifulSoup 
html = '''<p transform="translate(3,15)" class="SoccerPlayer SoccerPlayer-1 Soccer-Team  Outcome-Positive" id="12-8-3">
<p transform="translate(89,20)" class="SoccerPlayer SoccerPlayer-514 Soccer-Team Outcome-Positive" data-id="12-9-229">'''
soup = BeautifulSoup(html, 'html.parser')
for pr in soup.find_all("p"):
    print(pr["class"], pr.get('id'))

输出

['SoccerPlayer', 'SoccerPlayer-1', 'Soccer-Team', 'Outcome-Positive'] 12-8-3
['SoccerPlayer', 'SoccerPlayer-514', 'Soccer-Team', 'Outcome-Positive'] None

一个笨拙的替代方法是遍历属性并搜索包含id的任何属性：

print(pr["class"], pr.get([a for a in pr.attrs if 'id' in a][0]))
-&gt;
['SoccerPlayer', 'SoccerPlayer-1', 'Soccer-Team', 'Outcome-Positive'] 12-8-3
['SoccerPlayer', 'SoccerPlayer-514', 'Soccer-Team', 'Outcome-Positive'] 12-9-229

英文:

Issue here is that the second element do not have an attribute id, there is only a data-id, so you have to check that or use .get() if you’re not sure an attribute is defined:

pr.get(&#39;id&#39;)

Example

from bs4 import BeautifulSoup 
html = &#39;&#39;&#39;&lt;p transform=&quot;translate(3,15)&quot; class=&quot;SoccerPlayer SoccerPlayer-1 Soccer-Team  Outcome-Positive&quot; id=&quot;12-8-3&quot;&gt;
&lt;p transform=&quot;translate(89,20)&quot; class=&quot;SoccerPlayer SoccerPlayer-514 Soccer-Team Outcome-Positive&quot; data-id=&quot;12-9-229&quot;&gt;&#39;&#39;&#39;
soup = BeautifulSoup(html, &#39;html.parser&#39;)
for pr in soup.find_all(&quot;p&quot;):
    print(pr[&quot;class&quot;], pr.get(&#39;id&#39;))

Output

[&#39;SoccerPlayer&#39;, &#39;SoccerPlayer-1&#39;, &#39;Soccer-Team&#39;, &#39;Outcome-Positive&#39;] 12-8-3
[&#39;SoccerPlayer&#39;, &#39;SoccerPlayer-514&#39;, &#39;Soccer-Team&#39;, &#39;Outcome-Positive&#39;] None

An ugly alternative is to iterate the attributes and search for any attribute contains id:

print(pr[&quot;class&quot;], pr.get([a for a in pr.attrs if &#39;id&#39; in a][0]))
-&gt;
[&#39;SoccerPlayer&#39;, &#39;SoccerPlayer-1&#39;, &#39;Soccer-Team&#39;, &#39;Outcome-Positive&#39;] 12-8-3
[&#39;SoccerPlayer&#39;, &#39;SoccerPlayer-514&#39;, &#39;Soccer-Team&#39;, &#39;Outcome-Positive&#39;] 12-9-229

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用BeautifulSoup4从HTML中抓取id属性？

问题

答案1

答案2

示例

输出

Example

Output

如何使用Python脚本将指定文件添加到Metashape块中？

Python到可执行文件的大小优化

无法在Python的双重循环中计算迭代次数。

Python equivalent of Ruby's Array#pack, how to pack unknown string length and bytes together

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。