用Python中的BeautifulSoup4来提取子属性?

huangapple go评论78阅读模式
英文:

scrape a sub attribute? with bs4 in python

问题

I'm trying to scrape the id's on a website, but I can't figure out how to specify the entry I want to work with. this is the most I could narrow it down to a specific class, but I'm not sure how to target the number by 'id' under subclass 'data-preview.' here's what I've narrow the variable soup down to:

<li class="Li FnPreviewItem" data-preview='{ "type" : "animation", "id" : "288857982", "staticUrl" : "www.website.com/image.png", }'>
<div class="Li Inner FnImage">
<span class="Image" style="background-image:url(www.website.com/image.png);"></span>
</div>
<div class="ImgPreview FnPreviewImage MdNonDisp">
<span class="Image FnPreview" style="background-image:url(www.website.com/image.png);">
</span></div>
</li>

here is the relevant snippet of what I have so far:

from pathlib import Path
from bs4 import BeautifulSoup
import requests
import re

url = "www.website.com/image.png"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

elsoupo = soup.find(attrs={"class": "a fancy title for this class"})
print(elsoupo)

just started working with python, so hopefully I'm wording this so it makes some sense.

Tried to narrow it down with a second attribute that could have any number but I just None back.

elsoupoNum = elsoupo.find(attrs={"id":r'^[-+]?[0-9]+$'})

print(elsoupoNum)

Please note that the HTML and Python code you provided seem to contain some issues. If you need further assistance, please provide more context or a clearer description of the problem you're facing.

英文:

I'm trying to scrape the id's on a website, but I can't figure out how to specify the entry I want to work with. this is the most I could narrow it down to a specific class, but I'm not sure how to target the number by 'id' under subclass 'data-preview.' here's what I've narrow the variable soup down to:

&lt;li class=&quot;Li FnPreviewItem&quot; data-preview=&#39;{ &quot;type&quot; : &quot;animation&quot;, &quot;id&quot; : &quot;288857982&quot;, &quot;staticUrl&quot; : &quot;www.website.com/image.png&quot;,  }&#39;&gt;
&lt;div class=&quot;Li Inner FnImage&quot;&gt;
&lt;span class=&quot;Image&quot; style=&quot;background-image:url(www.website.com/image.png);&quot;&gt;&lt;/span&gt;
&lt;/div&gt;
&lt;div class=&quot;ImgPreview FnPreviewImage MdNonDisp&quot;&gt;
&lt;span class=&quot;Image FnPreview&quot; style=&quot;background-image:url(www.website.com/image.png);&quot;&gt;
&lt;/span&gt;&lt;/div&gt;
&lt;/li&gt;

here is the relevant snippet of what I have so far:

from pathlib import Path
from bs4 import BeautifulSoup
import requests
import re

url = &quot;www.website.com/image.png&quot;
r = requests.get(url)
soup = BeautifulSoup(r.content, &#39;html.parser&#39;)

elsoupo = soup.find(attrs={&quot;class&quot;: &quot;a fancy title for this class&quot;})
print(elsoupo)

just started working with python, so hopefully I'm wording this so it makes some sense.

Tried to narrow it down with a second attribute that could have any number but I just None back.

elsoupoNum = elsoupo.find(attrs={&quot;id&quot;:&quot;^[-+]?[0-9]+$&quot;})

print(elsoupoNum)

答案1

得分: 0

data-previewli 元素的一个属性,其值是一个(格式不正确的)JSON字符串。我已经为了简化而进行了修正,你可以查看这个链接

code

from bs4 import BeautifulSoup
import json

str = '''<li class="Li FnPreviewItem" data-preview='{ "type" : "animation", "id" : "288857982", "staticUrl" : "www.website.com/image.png"  }'>
<div class="Li Inner FnImage">
<span class="Image" style="background-image:url(www.website.com/image.png);"></span>
</div>
<div class="ImgPreview FnPreviewImage MdNonDisp">
<span class="Image FnPreview" style="background-image:url(www.website.com/image.png);"></span></div>
</li>'''

soup = BeautifulSoup(str, 'html.parser')
li = soup.select_one('li[data-preview]')
data = li.attrs['data-preview']
print(data)
j=json.loads(data)
print(j['id'])

output

{ "type" : "animation", "id" : "288857982", "staticUrl" : "www.website.com/image.png"  }
288857982

<details>
<summary>英文:</summary>

`data-preview` is an attribute for `li` element with a (ill-formed) json string as its value. I corrected it for simplicity, you may want to check [this][1].

**code**

from bs4 import BeautifulSoup
import json

str = '''
<li class="Li FnPreviewItem" data-preview='{ "type" : "animation", "id" : "288857982", "staticUrl" : "www.website.com/image.png" }'>
<div class="Li Inner FnImage">
<span class="Image" style="background-image:url(www.website.com/image.png);&quot;&gt;&lt;/span>
</div>
<div class="ImgPreview FnPreviewImage MdNonDisp">
<span class="Image FnPreview" style="background-image:url(www.website.com/image.png);&quot;>
</span></div>
</li>
'''

soup = BeautifulSoup(str, 'html.parser')
li = soup.select_one('li[data-preview]')
data = li.attrs['data-preview']
print(data)
j=json.loads(data)
print(j['id'])


**output**

{ "type" : "animation", "id" : "288857982", "staticUrl" : "www.website.com/image.png" }
288857982


  [1]: https://stackoverflow.com/questions/23705304/can-json-loads-ignore-trailing-commas

</details>



huangapple
  • 本文由 发表于 2023年2月18日 10:25:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/75490803.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定