如何使用BeautifulSoup抓取部分文本?

huangapple go评论81阅读模式
英文:

How to webscrape partial text using beautifulsoup?

问题

import requests
from bs4 import BeautifulSoup
testlink = 'https://www.mecca.com.au/dr-dennis-gross/alpha-beta-universal-daily-peel/V-052382.html?cgpath=skincare'
response = requests.get(testlink)

soup = BeautifulSoup(response.content, "html.parser")
ingredient = soup.find('div', id='core-product-information').text

print(ingredient)
英文:

This is the website: https://www.mecca.com.au/dr-dennis-gross/alpha-beta-universal-daily-peel/V-052382.html?cgpath=skincare

I am trying to scrape the Ingredient's section of this webpage.

Using F12, I can see it's under <div id=core-product-information
however there is lots of other text with it like brandnamelabel, promotions etc

I would just like the Ingredient list:

Step 1: water/aqua/eau, alcohol denat. (SD Alcohol 40-B), glycolic acid, potassium hydroxide, hamamelis virginiana (witch hazel) water, salicylic acid*, polysorbate 20, citric acid, lactic acid, malic acid, camellia sinensis leaf extract, achillea millefolium extract, chamomilla recutita (matricaria) flower extract, soy isoflavones, glycerin, copper PCA, zinc PCA, lecithin, alcohol, polysorbate 80, disodium EDTA, fragrance (parfum), phenoxyethanol, benzoic acid, sodium benzoate, potassium sorbate (* - exfoliant)
Ingredients Step 2: water/aqua/eau, sodium bicarbonate, ascorbic acid, retinol, tocopheryl acetate, resveratrol, ubiquinone, adenosine, achillea millefolium extract, camellia sinensis leaf extract, soy isoflavones, phospholipids, leuconostoc/radish root ferment filtrate, copper PCA, sodium PCA, zinc PCA, glycerin, polysorbate 20, octoxynol-9, tetrasodium EDTA, phenoxyethanol, sodium benzoate, potassium sorbate.

My code so far is:

import requests
from bs4 import BeautifulSoup
testlink = 'https://www.mecca.com.au/dr-dennis-gross/alpha-beta-universal-daily-peel/V-052382.html?cgpath=skincare'
response = requests.get(testlink)

soup = BeautifulSoup(response.content, "html.parser")
ingredient = soup.find('div', id='core-product-information').text

print(ingredient)

Is there a way to extract just the Ingredient list?
Very new to coding...
Thank you!

答案1

得分: 1

import requests
from bs4 import BeautifulSoup

testlink = 'https://www.mecca.com.au/dr-dennis-gross/alpha-beta-universal-daily-peel/V-052382.html?cgpath=skincare'
response = requests.get(testlink)

soup = BeautifulSoup(response.text, 'html.parser')

for div in soup.find_all('div'):
    # find "Ingredients" in the div
    if div.text.strip().startswith('Ingredients') and div.text.strip().split(' ')[0] != 'Ingredients':
        print(div.text.strip())
英文:

You need to search for specific keywords, then you should do the parsing. The second condition of if block is for preventing return only "Ingredients" string.

import requests
from bs4 import BeautifulSoup
testlink = 'https://www.mecca.com.au/dr-dennis-gross/alpha-beta-universal-daily-peel/V-052382.html?cgpath=skincare'
response = requests.get(testlink)

soup = BeautifulSoup(response.text, 'html.parser')

for div in soup.find_all('div'):
    # find "Ingredients" in the div
    if div.text.strip().startswith('Ingredients') and div.text.strip().split(' ')[0] != 'Ingredients':
        print(div.text.strip())

The output of the above code is:

IngredientsStep 1: water/aqua/eau, alcohol denat. (SD Alcohol 40-B), glycolic acid, potassium hydroxide, hamamelis virginiana (witch hazel) water, salicylic acid*, polysorbate 20, citric acid, lactic acid, malic acid, camellia sinensis leaf extract, achillea millefolium extract, chamomilla recutita (matricaria) flower extract, soy isoflavones, glycerin, copper PCA, zinc PCA, lecithin, alcohol, polysorbate 80, disodium EDTA, fragrance (parfum), phenoxyethanol, benzoic acid, sodium benzoate, potassium sorbate (* - exfoliant)Ingredients Step 2: water/aqua/eau, sodium bicarbonate, ascorbic acid, retinol, tocopheryl acetate, resveratrol, ubiquinone, adenosine, achillea millefolium extract, camellia sinensis leaf extract, soy isoflavones, phospholipids, leuconostoc/radish root ferment filtrate, copper PCA, sodium PCA, zinc PCA, glycerin, polysorbate 20, octoxynol-9, tetrasodium EDTA, phenoxyethanol, sodium benzoate, potassium sorbate.

Now, you can do further processing for strings on this result.

huangapple
  • 本文由 发表于 2023年7月24日 19:50:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/76754206.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定