如何从特定 div 类别下的 HTML 中抓取 <p>。

huangapple go评论60阅读模式
英文:

How to scrape <p> from html under a certain div class

问题

只返回翻译好的部分:

# 从HTML中提取4、5、6(描述、用途和来源),这些内容被标记为'p',在'div'下应用于不同的元素,格式化url = f'....{element}'。

print('当前路径为:', currentPath)

content_list = []
url = 'https://pubchem.ncbi.nlm.gov/element/Antimony'
res = requests.post(url)
# print(res.text)

soup = bs(res.text, 'lxml')

content = soup.find_all('div', class_="section-content-item")
for p in content:
    p = soup.find('p')
    content_list.append(p)

print(content_list)
英文:

I'd like to scrap 4, 5, 6 (description, uses and sources) from HTML which is tagged as 'p' under a 'div' and apply this for different elements by formatting url = f&#39;....{element}.

print(&#39;Current path is:&#39;, currentPath)

content_list = []
url = &#39;https://pubchem.ncbi.nlm.nih.gov/element/Antimony&#39;       
res = requests.post(url)
# print(res.text)

soup = bs(res.text, &#39;lxml&#39;)

content = soup.find_all(&#39;div&#39;, class_=&quot;section-content-item&quot;)
for p in content:
    p = soup.find(&#39;p&#39;)
    content_list.append(p)
    
print(content_list)

答案1

得分: 0

始终首先查看您的汤,看看是否所有预期的成分都已到位。

英文:

Always and first of all, take a look at your soup to see if all the expected ingredients are in place.


Content of website is genreated dynamically and and comes from an api, so you won't get it with BeautifulSoup because it is not in response.

You have to request the api to get your goal - Check the XHR tab of your browsers dev tools in network section.

Example

Just to point in a direction, simply iterate the results and pick the information, to convert in format that fit your needs.

import requests

atomic_numbers = [&#39;51&#39;]
sections = [&#39;Description&#39;,&#39;Uses&#39;,&#39;Sources&#39;]

for e in atomic_numbers:
    section_data = requests.get(f&#39;https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/element/{e}/JSON/&#39;).json()[&#39;Record&#39;][&#39;Section&#39;]
    for s in section_data:
        if s[&#39;TOCHeading&#39;] in sections:
            print(s[&#39;Information&#39;])

Output

[{&#39;ReferenceNumber&#39;: 6, &#39;Value&#39;: {&#39;StringWithMarkup&#39;: [{&#39;String&#39;: &#39;Antimony is a poor conductor of heat and electricity. Antimony and many of its compounds are toxic.&#39;}]}}]
[{&#39;ReferenceNumber&#39;: 5, &#39;Value&#39;: {&#39;StringWithMarkup&#39;: [{&#39;String&#39;: &quot;Antimony is a brittle metal and is a poor conductor of heat and electricity. Very pure antimony is used to make certain types of semiconductor devices, such as diodes and infrared detectors. Antimony is alloyed with lead to increase lead&#39;s durability. Antimony alloys are also used in batteries, low friction metals, type metal and cable sheathing, among other products. Antimony compounds are used to make flame-proofing materials, paints, ceramic enamels, glass and pottery. The ancient Egyptians used antimony, in the form of stibnite, for black eye make-up.&quot;}]}}, {&#39;ReferenceNumber&#39;: 6, &#39;Value&#39;: {&#39;StringWithMarkup&#39;: [{&#39;String&#39;: &#39;Antimony is finding use in semiconductor technology for making infrared detectors, diodes and Hall-effect devices. It greatly increases the hardness and mechanical strength of lead. Batteries, antifriction alloys, type metal, small arms and tracer bullets, cable sheathing, and minor products use about half the metal produced. Compounds taking up the other half are oxides, sulfides, sodium antimonate, and antimony trichloride. These are used in manufacturing flame-proofing compounds, paints ceramic enamels, glass, and pottery.&#39;}]}}]
[{&#39;ReferenceNumber&#39;: 6, &#39;Value&#39;: {&#39;StringWithMarkup&#39;: [{&#39;String&#39;: &#39;Antimony is not abundant, but is found in over 100 mineral species. It is sometimes found natively, but more frequently it is found as the sulfide stibnite.&#39;}]}}]

huangapple
  • 本文由 发表于 2023年2月8日 12:05:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/75381284.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定