英文:
How to scrape id attribute from HTML with bs4?
问题
from bs4 import BeautifulSoup as soup
html = '''
<p transform="translate(3,15)" class="SoccerPlayer SoccerPlayer-1 Soccer-Team Outcome-Positive" id="12-8-3">
<p transform="translate(89,20)" class="SoccerPlayer SoccerPlayer-514 Soccer-Team Outcome-Positive" data-id="12-9-229">
'''
parsed_html = soup(html, 'html.parser')
for pr in parsed_html.find_all('p'):
print(pr.get('class'), pr.get('id'))
英文:
I'm trying to scrape a specific piece of data from HTML.
html = '''<p transform="translate(3,15)" class="SoccerPlayer SoccerPlayer-1 Soccer-Team Outcome-Positive" id="12-8-3">
<p transform="translate(89,20)" class="SoccerPlayer SoccerPlayer-514 Soccer-Team Outcome-Positive" data-id="12-9-229">'''
From this piece of html I'm attempting to scrape the class
, and id
attributes.
I've tried
from bs4 import BeautifulSoup as soup
for pr in soup.find_all("p"):
print(pr['class'], pr['id'])
but I get a keyerror on id
.
答案1
得分: 1
你的代码尝试在未初始化BeautifulSoup
实例之前使用find_all()
方法:
from bs4 import BeautifulSoup
html_data = '''<p transform="translate(3,15)" class="SoccerPlayer SoccerPlayer-1 Soccer-Team Outcome-Positive" id="12-8-3"'''
soup = BeautifulSoup(html_data, 'html.parser')
for pr in soup.find_all("p"):
print(pr["class"], pr["id"])
英文:
Your code is trying to use the find_all()
method without first initializing an instance of BeautifulSoup
:
from bs4 import BeautifulSoup
html_data = '''<p transform="translate(3,15)" class="SoccerPlayer SoccerPlayer-1 Soccer-Team Outcome-Positive" id="12-8-3"'''
soup = BeautifulSoup(html_data, 'html.parser')
for pr in soup.find_all("p"):
print(pr["class"], pr["id"])
答案2
得分: 1
问题在于第二个元素没有属性id
,只有一个data-id
,所以您必须检查那个或者使用.get()
来确保属性是否被定义:
pr.get('id')
示例
from bs4 import BeautifulSoup
html = '''<p transform="translate(3,15)" class="SoccerPlayer SoccerPlayer-1 Soccer-Team Outcome-Positive" id="12-8-3">
<p transform="translate(89,20)" class="SoccerPlayer SoccerPlayer-514 Soccer-Team Outcome-Positive" data-id="12-9-229">'''
soup = BeautifulSoup(html, 'html.parser')
for pr in soup.find_all("p"):
print(pr["class"], pr.get('id'))
输出
['SoccerPlayer', 'SoccerPlayer-1', 'Soccer-Team', 'Outcome-Positive'] 12-8-3
['SoccerPlayer', 'SoccerPlayer-514', 'Soccer-Team', 'Outcome-Positive'] None
一个笨拙的替代方法是遍历属性并搜索包含id
的任何属性:
print(pr["class"], pr.get([a for a in pr.attrs if 'id' in a][0]))
->
['SoccerPlayer', 'SoccerPlayer-1', 'Soccer-Team', 'Outcome-Positive'] 12-8-3
['SoccerPlayer', 'SoccerPlayer-514', 'Soccer-Team', 'Outcome-Positive'] 12-9-229
英文:
Issue here is that the second element do not have an attribute id
, there is only a data-id
, so you have to check that or use .get()
if you’re not sure an attribute is defined:
pr.get('id')
Example
from bs4 import BeautifulSoup
html = '''<p transform="translate(3,15)" class="SoccerPlayer SoccerPlayer-1 Soccer-Team Outcome-Positive" id="12-8-3">
<p transform="translate(89,20)" class="SoccerPlayer SoccerPlayer-514 Soccer-Team Outcome-Positive" data-id="12-9-229">'''
soup = BeautifulSoup(html, 'html.parser')
for pr in soup.find_all("p"):
print(pr["class"], pr.get('id'))
Output
['SoccerPlayer', 'SoccerPlayer-1', 'Soccer-Team', 'Outcome-Positive'] 12-8-3
['SoccerPlayer', 'SoccerPlayer-514', 'Soccer-Team', 'Outcome-Positive'] None
An ugly alternative is to iterate the attributes and search for any attribute contains id
:
print(pr["class"], pr.get([a for a in pr.attrs if 'id' in a][0]))
->
['SoccerPlayer', 'SoccerPlayer-1', 'Soccer-Team', 'Outcome-Positive'] 12-8-3
['SoccerPlayer', 'SoccerPlayer-514', 'Soccer-Team', 'Outcome-Positive'] 12-9-229
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论