2023年2月18日 21:26:29go评论104阅读模式

英文:

How to extract 'Odor' information from PubChem using BeautifulSoup

问题

我写了以下Python代码，从PubChem中提取特定分子的“odor”信息；在这种情况下，分子是辛醛（CID=31289）。这个分子的网页链接是：https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor

import requests
from bs4 import BeautifulSoup
url = 'https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
odor_section = soup.find('section', {'id': 'Odor'})
odor_info = odor_section.find('div', {'class': 'section-content'})
print(odor_info.text.strip())

我得到以下错误信息：
AttributeError: 'NoneType' object has no attribute 'find'
似乎BeautifulSoup没有提取整个页面的信息。

我期望的输出是：
橙花玫瑰香味，花香，蜡质，绿色

英文:

I wrote the following Python code extract 'odor' information from PubChem for a particular molecule; in this case molecule nonanal (CID=31289) The webpage for this molecule is: https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor

import requests
from bs4 import BeautifulSoup
url = &#39;https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor&#39;
page = requests.get(url)
soup = BeautifulSoup(page.content, &#39;html.parser&#39;)
odor_section = soup.find(&#39;section&#39;, {&#39;id&#39;: &#39;Odor&#39;})
odor_info = odor_section.find(&#39;div&#39;, {&#39;class&#39;: &#39;section-content&#39;})
print(odor_info.text.strip())

I get the following error.
AttributeError: 'NoneType' object has no attribute 'find'
It seems that not the whole page information is extracted by BeautifulSoup.

I expect the following output:
Orange-rose odor, Floral, waxy, green

答案1

得分: 2

以下是翻译好的内容，代码部分不翻译：

"The page in question makes an AJAX request to load its data. We can see this in a web browser by looking at the Network tab of the dev tools (F12 in many browsers):

如何使用BeautifulSoup从PubChem提取’Odor’信息

That is to say, the data simply isn't there when the initial page loads - so it isn't found by BeautifulSoup.

To solve the problem:

use Selenium, which can actually run the JavaScript code and thus populate the page with the desired data; or
simply query the API according to the request seen when loading the page in the browser. Thus:

PubChem_Nonanal_CID = 31289
compound_data_url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{}/JSON/'
compound_info = requests.get(compound_data_url.format(PubChem_Nonanal_CID))
print(compound_info.json())

Parsing the JSON Reply

Parsing it proves a bit of a challenge, as it is comprised of many lists.
If the order of properties isn't guaranteed, you could opt for a solution like this:

for section in compound_info.json()['Record']['Section']:
    if section['TOCHeading'] == "Chemical and Physical Properties":
        for sub_section in section['Section']:
            if sub_section['TOCHeading'] == 'Experimental Properties':
                for sub_sub_section in sub_section['Section']:
                    if sub_sub_section['TOCHeading'] == "Odor":
                        print(sub_sub_section['Information'][0]['Value']['StringWithMarkup'][0]['String'])
                        break

Otherwise, follow the schema from a JSON-parsing website like jsonformatter.com

odor = compound_info.json()['Record']['Section'][3]['Section'][1]['Section'][2]['Information'][0]['Value']['StringWithMarkup'][0]['String']

英文:

The page in question makes an AJAX request to load its data. We can see this in a web browser by looking at the Network tab of the dev tools (F12 in many browsers):

如何使用BeautifulSoup从PubChem提取’Odor’信息

That is to say, the data simply isn't there when the initial page loads - so it isn't found by BeautifulSoup.

To solve the problem:

use Selenium, which can actually run the JavaScript code and thus populate the page with the desired data; or
simply query the API according to the request seen when loading the page in the browser. Thus:

PubChem_Nonanal_CID=31289
coumpund_data_url = &#39;https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{}/JSON/&#39;
compound_info = requests.get(coumpund_data_url.format(PubChem_Nonanal_CID))
print (compund_info.json())

Parsing the JSON Reply

Parsing it proves a bit of a challenge, as it is comprised of many lists.
If the order of properties isn't guaranteed, you could opt for a solution like this:

for section in compund_info.json()[&#39;Record&#39;][&#39;Section&#39;]:
    if section[&#39;TOCHeading&#39;]==&quot;Chemical and Physical Properties&quot;:
       for sub_section in section[&#39;Section&#39;]:
           if sub_section[&#39;TOCHeading&#39;] == &#39;Experimental Properties&#39;:
               for sub_sub_section in sub_section[&#39;Section&#39;]:
                   if sub_sub_section[&#39;TOCHeading&#39;]==&quot;Odor&quot;:
                       print(sub_sub_section[&#39;Information&#39;][0][&#39;Value&#39;][&#39;StringWithMarkup&#39;][0][&#39;String&#39;])
                       break

Otherwise, follow the schema from a JSON-parsing website like jsonformatter.com

# object►Record►Section►3►Section►1►Section►2►Information►0►Value►StringWithMarkup►0►String`
odor = compund_info.json()[&#39;Record&#39;][&#39;Section&#39;][3][&#39;Section&#39;][1][&#39;Section&#39;][2][&#39;Information&#39;][0][&#39;Value&#39;][&#39;StringWithMarkup&#39;][0][&#39;String&#39;]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用BeautifulSoup从PubChem提取’Odor’信息

问题

答案1

无法在Python的Selenium中找到嵌套的阴影DOM元素。

基于数值分组转换pandas列值

Locating a Web Element in a Drop-down list by Selenium Python

如何正确使用boto3的if语句

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论