英文:
How to extract 'Odor' information from PubChem using BeautifulSoup
问题
我写了以下Python代码,从PubChem中提取特定分子的“odor”信息;在这种情况下,分子是辛醛(CID=31289)。这个分子的网页链接是:https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor
import requests
from bs4 import BeautifulSoup
url = 'https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
odor_section = soup.find('section', {'id': 'Odor'})
odor_info = odor_section.find('div', {'class': 'section-content'})
print(odor_info.text.strip())
我得到以下错误信息:
AttributeError: 'NoneType' object has no attribute 'find'
似乎BeautifulSoup没有提取整个页面的信息。
我期望的输出是:
橙花玫瑰香味,花香,蜡质,绿色
英文:
I wrote the following Python code extract 'odor' information from PubChem for a particular molecule; in this case molecule nonanal (CID=31289) The webpage for this molecule is: https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor
import requests
from bs4 import BeautifulSoup
url = 'https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
odor_section = soup.find('section', {'id': 'Odor'})
odor_info = odor_section.find('div', {'class': 'section-content'})
print(odor_info.text.strip())
I get the following error.
AttributeError: 'NoneType' object has no attribute 'find'
It seems that not the whole page information is extracted by BeautifulSoup.
I expect the following output:
Orange-rose odor, Floral, waxy, green
答案1
得分: 2
以下是翻译好的内容,代码部分不翻译:
"The page in question makes an AJAX request to load its data. We can see this in a web browser by looking at the Network tab of the dev tools (F12 in many browsers):
That is to say, the data simply isn't there when the initial page loads - so it isn't found by BeautifulSoup.
To solve the problem:
-
use Selenium, which can actually run the JavaScript code and thus populate the page with the desired data; or
-
simply query the API according to the request seen when loading the page in the browser. Thus:
PubChem_Nonanal_CID = 31289
compound_data_url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{}/JSON/'
compound_info = requests.get(compound_data_url.format(PubChem_Nonanal_CID))
print(compound_info.json())
Parsing the JSON Reply
Parsing it proves a bit of a challenge, as it is comprised of many lists.
If the order of properties isn't guaranteed, you could opt for a solution like this:
for section in compound_info.json()['Record']['Section']:
if section['TOCHeading'] == "Chemical and Physical Properties":
for sub_section in section['Section']:
if sub_section['TOCHeading'] == 'Experimental Properties':
for sub_sub_section in sub_section['Section']:
if sub_sub_section['TOCHeading'] == "Odor":
print(sub_sub_section['Information'][0]['Value']['StringWithMarkup'][0]['String'])
break
Otherwise, follow the schema from a JSON-parsing website like jsonformatter.com
odor = compound_info.json()['Record']['Section'][3]['Section'][1]['Section'][2]['Information'][0]['Value']['StringWithMarkup'][0]['String']
英文:
The page in question makes an AJAX request to load its data. We can see this in a web browser by looking at the Network tab of the dev tools (F12 in many browsers):
That is to say, the data simply isn't there when the initial page loads - so it isn't found by BeautifulSoup.
To solve the problem:
-
use Selenium, which can actually run the JavaScript code and thus populate the page with the desired data; or
-
simply query the API according to the request seen when loading the page in the browser. Thus:
PubChem_Nonanal_CID=31289
coumpund_data_url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{}/JSON/'
compound_info = requests.get(coumpund_data_url.format(PubChem_Nonanal_CID))
print (compund_info.json())
Parsing the JSON Reply
Parsing it proves a bit of a challenge, as it is comprised of many lists.
If the order of properties isn't guaranteed, you could opt for a solution like this:
for section in compund_info.json()['Record']['Section']:
if section['TOCHeading']=="Chemical and Physical Properties":
for sub_section in section['Section']:
if sub_section['TOCHeading'] == 'Experimental Properties':
for sub_sub_section in sub_section['Section']:
if sub_sub_section['TOCHeading']=="Odor":
print(sub_sub_section['Information'][0]['Value']['StringWithMarkup'][0]['String'])
break
Otherwise, follow the schema from a JSON-parsing website like jsonformatter.com
# object►Record►Section►3►Section►1►Section►2►Information►0►Value►StringWithMarkup►0►String`
odor = compund_info.json()['Record']['Section'][3]['Section'][1]['Section'][2]['Information'][0]['Value']['StringWithMarkup'][0]['String']
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论