Python BeautifulSoup Span Scraping

huangapple go评论63阅读模式
英文:

Python BeautifulSoup Span Scraping

问题

我正在尝试抓取Span ID内的字段,但值不像使用find并从span中获取文本那样简单。

以下是网页的HTML。
HTML

我想要打印出"B0C4YKLXPQ"。

这是我尝试的所有失败的方法。

  • page_soup.find("div", {"id": "twisterContainer"}).find_all("data-asin")

  • page_soup.find("div", {"id": "twisterContainer"}).find("span", {"id": "fitRecommendationsSection"}).span["data-asin"]

  • page_soup.find("div", {"id": "twisterContainer"}).find("span", {"id": "fitRecommendationsSection"}).find_all("data-asin")

  • page_soup.find("div", {"id": "twisterContainer"}).find_all("data-asin")

  • page_soup.find("div", {"id": "twisterContainer"}).find_all(["data-asin"])

英文:

I am trying to scrape fields within a Span ID, but the value is not as simple as using find and taking the text from a span.

Below is the HTML from the webpage.
HTML

I am trying to print "B0C4YKLXPQ"

This gets me the

Below are all attempts that failed.

- page_soup.find("div", {"id": "twisterContainer"}).find_all("data-asin")

- page_soup.find("div", {"id": "twisterContainer"}).find("span", {"id": "fitRecommendationsSection"}).span["data-asin"]

- page_soup.find("div", {"id": "twisterContainer"}).find("span", {"id": "fitRecommendationsSection"}).find_all("data-asin")

- page_soup.find("div", {"id": "twisterContainer"}).find_all("data-asin")

- page_soup.find("div", {"id": "twisterContainer"}).find_all(["data-asin"])

答案1

得分: 1

以下是已翻译的代码部分:

以下代码有很大的可能性可以正常运行除非您的IP由于一些原因被亚马逊列入黑名单例如过多的网络爬取尝试

import requests
from bs4 import BeautifulSoup as bs

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}

url = 'https://www.amazon.com/dp/B002G9UDYG'

r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')

item = soup.select_one('span[id="fitRecommendationsSection"]').get('data-asin')
print(item)

终端中的结果

B0C4YKLXPQ

BeautifulSoup文档可以在[这里](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)找到
英文:

The following code has good chances of working, unless your IP has been blacklisted by Amazon for some various reasons, like too many scraping attempts:

import requests
from bs4 import BeautifulSoup as bs

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}

url = 'https://www.amazon.com/dp/B002G9UDYG'

r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')

item = soup.select_one('span[id="fitRecommendationsSection"]').get('data-asin')
print(item)

Result in terminal:

B0C4YKLXPQ

BeautifulSoup documentation can be found here.

huangapple
  • 本文由 发表于 2023年6月5日 20:37:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76406496.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定