英文:
scrape a sub attribute? with bs4 in python
问题
I'm trying to scrape the id's on a website, but I can't figure out how to specify the entry I want to work with. this is the most I could narrow it down to a specific class, but I'm not sure how to target the number by 'id' under subclass 'data-preview.' here's what I've narrow the variable soup down to:
<li class="Li FnPreviewItem" data-preview='{ "type" : "animation", "id" : "288857982", "staticUrl" : "www.website.com/image.png", }'>
<div class="Li Inner FnImage">
<span class="Image" style="background-image:url(www.website.com/image.png);"></span>
</div>
<div class="ImgPreview FnPreviewImage MdNonDisp">
<span class="Image FnPreview" style="background-image:url(www.website.com/image.png);">
</span></div>
</li>
here is the relevant snippet of what I have so far:
from pathlib import Path
from bs4 import BeautifulSoup
import requests
import re
url = "www.website.com/image.png"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
elsoupo = soup.find(attrs={"class": "a fancy title for this class"})
print(elsoupo)
just started working with python, so hopefully I'm wording this so it makes some sense.
Tried to narrow it down with a second attribute that could have any number but I just None back.
elsoupoNum = elsoupo.find(attrs={"id":r'^[-+]?[0-9]+$'})
print(elsoupoNum)
Please note that the HTML and Python code you provided seem to contain some issues. If you need further assistance, please provide more context or a clearer description of the problem you're facing.
英文:
I'm trying to scrape the id's on a website, but I can't figure out how to specify the entry I want to work with. this is the most I could narrow it down to a specific class, but I'm not sure how to target the number by 'id' under subclass 'data-preview.' here's what I've narrow the variable soup down to:
<li class="Li FnPreviewItem" data-preview='{ "type" : "animation", "id" : "288857982", "staticUrl" : "www.website.com/image.png", }'>
<div class="Li Inner FnImage">
<span class="Image" style="background-image:url(www.website.com/image.png);"></span>
</div>
<div class="ImgPreview FnPreviewImage MdNonDisp">
<span class="Image FnPreview" style="background-image:url(www.website.com/image.png);">
</span></div>
</li>
here is the relevant snippet of what I have so far:
from pathlib import Path
from bs4 import BeautifulSoup
import requests
import re
url = "www.website.com/image.png"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
elsoupo = soup.find(attrs={"class": "a fancy title for this class"})
print(elsoupo)
just started working with python, so hopefully I'm wording this so it makes some sense.
Tried to narrow it down with a second attribute that could have any number but I just None back.
elsoupoNum = elsoupo.find(attrs={"id":"^[-+]?[0-9]+$"})
print(elsoupoNum)
答案1
得分: 0
data-preview
是 li
元素的一个属性,其值是一个(格式不正确的)JSON字符串。我已经为了简化而进行了修正,你可以查看这个链接。
code
from bs4 import BeautifulSoup
import json
str = '''<li class="Li FnPreviewItem" data-preview='{ "type" : "animation", "id" : "288857982", "staticUrl" : "www.website.com/image.png" }'>
<div class="Li Inner FnImage">
<span class="Image" style="background-image:url(www.website.com/image.png);"></span>
</div>
<div class="ImgPreview FnPreviewImage MdNonDisp">
<span class="Image FnPreview" style="background-image:url(www.website.com/image.png);"></span></div>
</li>'''
soup = BeautifulSoup(str, 'html.parser')
li = soup.select_one('li[data-preview]')
data = li.attrs['data-preview']
print(data)
j=json.loads(data)
print(j['id'])
output
{ "type" : "animation", "id" : "288857982", "staticUrl" : "www.website.com/image.png" }
288857982
<details>
<summary>英文:</summary>
`data-preview` is an attribute for `li` element with a (ill-formed) json string as its value. I corrected it for simplicity, you may want to check [this][1].
**code**
from bs4 import BeautifulSoup
import json
str = '''
<li class="Li FnPreviewItem" data-preview='{ "type" : "animation", "id" : "288857982", "staticUrl" : "www.website.com/image.png" }'>
<div class="Li Inner FnImage">
<span class="Image" style="background-image:url(www.website.com/image.png);"></span>
</div>
<div class="ImgPreview FnPreviewImage MdNonDisp">
<span class="Image FnPreview" style="background-image:url(www.website.com/image.png);">
</span></div>
</li>
'''
soup = BeautifulSoup(str, 'html.parser')
li = soup.select_one('li[data-preview]')
data = li.attrs['data-preview']
print(data)
j=json.loads(data)
print(j['id'])
**output**
{ "type" : "animation", "id" : "288857982", "staticUrl" : "www.website.com/image.png" }
288857982
[1]: https://stackoverflow.com/questions/23705304/can-json-loads-ignore-trailing-commas
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论