问题

import requests
from bs4 import BeautifulSoup
testlink = 'https://www.mecca.com.au/dr-dennis-gross/alpha-beta-universal-daily-peel/V-052382.html?cgpath=skincare'
response = requests.get(testlink)

soup = BeautifulSoup(response.content, "html.parser")
ingredient = soup.find('div', id='core-product-information').text

print(ingredient)

英文:

This is the website: https://www.mecca.com.au/dr-dennis-gross/alpha-beta-universal-daily-peel/V-052382.html?cgpath=skincare

I am trying to scrape the Ingredient's section of this webpage.

Using F12, I can see it's under <div id=core-product-information
however there is lots of other text with it like brandnamelabel, promotions etc

I would just like the Ingredient list:

Step 1: water/aqua/eau, alcohol denat. (SD Alcohol 40-B), glycolic acid, potassium hydroxide, hamamelis virginiana (witch hazel) water, salicylic acid*, polysorbate 20, citric acid, lactic acid, malic acid, camellia sinensis leaf extract, achillea millefolium extract, chamomilla recutita (matricaria) flower extract, soy isoflavones, glycerin, copper PCA, zinc PCA, lecithin, alcohol, polysorbate 80, disodium EDTA, fragrance (parfum), phenoxyethanol, benzoic acid, sodium benzoate, potassium sorbate (* - exfoliant)
Ingredients Step 2: water/aqua/eau, sodium bicarbonate, ascorbic acid, retinol, tocopheryl acetate, resveratrol, ubiquinone, adenosine, achillea millefolium extract, camellia sinensis leaf extract, soy isoflavones, phospholipids, leuconostoc/radish root ferment filtrate, copper PCA, sodium PCA, zinc PCA, glycerin, polysorbate 20, octoxynol-9, tetrasodium EDTA, phenoxyethanol, sodium benzoate, potassium sorbate.

My code so far is:

import requests
from bs4 import BeautifulSoup
testlink = &#39;https://www.mecca.com.au/dr-dennis-gross/alpha-beta-universal-daily-peel/V-052382.html?cgpath=skincare&#39;
response = requests.get(testlink)

soup = BeautifulSoup(response.content, &quot;html.parser&quot;)
ingredient = soup.find(&#39;div&#39;, id=&#39;core-product-information&#39;).text

print(ingredient)

Is there a way to extract just the Ingredient list?
Very new to coding...
Thank you!

答案1

得分: 1

import requests
from bs4 import BeautifulSoup

testlink = 'https://www.mecca.com.au/dr-dennis-gross/alpha-beta-universal-daily-peel/V-052382.html?cgpath=skincare'
response = requests.get(testlink)

soup = BeautifulSoup(response.text, 'html.parser')

for div in soup.find_all('div'):
    # find "Ingredients" in the div
    if div.text.strip().startswith('Ingredients') and div.text.strip().split(' ')[0] != 'Ingredients':
        print(div.text.strip())

英文:

You need to search for specific keywords, then you should do the parsing. The second condition of if block is for preventing return only "Ingredients" string.

import requests
from bs4 import BeautifulSoup
testlink = &#39;https://www.mecca.com.au/dr-dennis-gross/alpha-beta-universal-daily-peel/V-052382.html?cgpath=skincare&#39;
response = requests.get(testlink)

soup = BeautifulSoup(response.text, &#39;html.parser&#39;)

for div in soup.find_all(&#39;div&#39;):
    # find &quot;Ingredients&quot; in the div
    if div.text.strip().startswith(&#39;Ingredients&#39;) and div.text.strip().split(&#39; &#39;)[0] != &#39;Ingredients&#39;:
        print(div.text.strip())

The output of the above code is:

IngredientsStep 1: water/aqua/eau, alcohol denat. (SD Alcohol 40-B), glycolic acid, potassium hydroxide, hamamelis virginiana (witch hazel) water, salicylic acid*, polysorbate 20, citric acid, lactic acid, malic acid, camellia sinensis leaf extract, achillea millefolium extract, chamomilla recutita (matricaria) flower extract, soy isoflavones, glycerin, copper PCA, zinc PCA, lecithin, alcohol, polysorbate 80, disodium EDTA, fragrance (parfum), phenoxyethanol, benzoic acid, sodium benzoate, potassium sorbate (* - exfoliant)Ingredients Step 2: water/aqua/eau, sodium bicarbonate, ascorbic acid, retinol, tocopheryl acetate, resveratrol, ubiquinone, adenosine, achillea millefolium extract, camellia sinensis leaf extract, soy isoflavones, phospholipids, leuconostoc/radish root ferment filtrate, copper PCA, sodium PCA, zinc PCA, glycerin, polysorbate 20, octoxynol-9, tetrasodium EDTA, phenoxyethanol, sodium benzoate, potassium sorbate.

Now, you can do further processing for strings on this result.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用BeautifulSoup抓取部分文本？

问题

答案1

将非括号列表项写入CSV单元格

Python多进程在同一个AWS Glue 4.0作业中卡住

无法将稳定的差分流程移至我的M1 MacBook。

出现了未定义的字母？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论