英文:
How to webscrape partial text using beautifulsoup?
问题
import requests
from bs4 import BeautifulSoup
testlink = 'https://www.mecca.com.au/dr-dennis-gross/alpha-beta-universal-daily-peel/V-052382.html?cgpath=skincare'
response = requests.get(testlink)
soup = BeautifulSoup(response.content, "html.parser")
ingredient = soup.find('div', id='core-product-information').text
print(ingredient)
英文:
This is the website: https://www.mecca.com.au/dr-dennis-gross/alpha-beta-universal-daily-peel/V-052382.html?cgpath=skincare
I am trying to scrape the Ingredient's section of this webpage.
Using F12, I can see it's under <div id=core-product-information
however there is lots of other text with it like brandnamelabel, promotions etc
I would just like the Ingredient list:
Step 1: water/aqua/eau, alcohol denat. (SD Alcohol 40-B), glycolic acid, potassium hydroxide, hamamelis virginiana (witch hazel) water, salicylic acid*, polysorbate 20, citric acid, lactic acid, malic acid, camellia sinensis leaf extract, achillea millefolium extract, chamomilla recutita (matricaria) flower extract, soy isoflavones, glycerin, copper PCA, zinc PCA, lecithin, alcohol, polysorbate 80, disodium EDTA, fragrance (parfum), phenoxyethanol, benzoic acid, sodium benzoate, potassium sorbate (* - exfoliant)
Ingredients Step 2: water/aqua/eau, sodium bicarbonate, ascorbic acid, retinol, tocopheryl acetate, resveratrol, ubiquinone, adenosine, achillea millefolium extract, camellia sinensis leaf extract, soy isoflavones, phospholipids, leuconostoc/radish root ferment filtrate, copper PCA, sodium PCA, zinc PCA, glycerin, polysorbate 20, octoxynol-9, tetrasodium EDTA, phenoxyethanol, sodium benzoate, potassium sorbate.
My code so far is:
import requests
from bs4 import BeautifulSoup
testlink = 'https://www.mecca.com.au/dr-dennis-gross/alpha-beta-universal-daily-peel/V-052382.html?cgpath=skincare'
response = requests.get(testlink)
soup = BeautifulSoup(response.content, "html.parser")
ingredient = soup.find('div', id='core-product-information').text
print(ingredient)
Is there a way to extract just the Ingredient list?
Very new to coding...
Thank you!
答案1
得分: 1
import requests
from bs4 import BeautifulSoup
testlink = 'https://www.mecca.com.au/dr-dennis-gross/alpha-beta-universal-daily-peel/V-052382.html?cgpath=skincare'
response = requests.get(testlink)
soup = BeautifulSoup(response.text, 'html.parser')
for div in soup.find_all('div'):
# find "Ingredients" in the div
if div.text.strip().startswith('Ingredients') and div.text.strip().split(' ')[0] != 'Ingredients':
print(div.text.strip())
英文:
You need to search for specific keywords, then you should do the parsing. The second condition of if block is for preventing return only "Ingredients" string.
import requests
from bs4 import BeautifulSoup
testlink = 'https://www.mecca.com.au/dr-dennis-gross/alpha-beta-universal-daily-peel/V-052382.html?cgpath=skincare'
response = requests.get(testlink)
soup = BeautifulSoup(response.text, 'html.parser')
for div in soup.find_all('div'):
# find "Ingredients" in the div
if div.text.strip().startswith('Ingredients') and div.text.strip().split(' ')[0] != 'Ingredients':
print(div.text.strip())
The output of the above code is:
IngredientsStep 1: water/aqua/eau, alcohol denat. (SD Alcohol 40-B), glycolic acid, potassium hydroxide, hamamelis virginiana (witch hazel) water, salicylic acid*, polysorbate 20, citric acid, lactic acid, malic acid, camellia sinensis leaf extract, achillea millefolium extract, chamomilla recutita (matricaria) flower extract, soy isoflavones, glycerin, copper PCA, zinc PCA, lecithin, alcohol, polysorbate 80, disodium EDTA, fragrance (parfum), phenoxyethanol, benzoic acid, sodium benzoate, potassium sorbate (* - exfoliant)Ingredients Step 2: water/aqua/eau, sodium bicarbonate, ascorbic acid, retinol, tocopheryl acetate, resveratrol, ubiquinone, adenosine, achillea millefolium extract, camellia sinensis leaf extract, soy isoflavones, phospholipids, leuconostoc/radish root ferment filtrate, copper PCA, sodium PCA, zinc PCA, glycerin, polysorbate 20, octoxynol-9, tetrasodium EDTA, phenoxyethanol, sodium benzoate, potassium sorbate.
Now, you can do further processing for strings on this result.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论