在同一个HTML代码中抓取三个不同的文本内容。

huangapple go评论98阅读模式
英文:

Scraping three diffrent texts in three same html codes

问题

在这段代码中,我只想抓取电影的时长数据,但它还同时抓取了电影的上映日期!我会将HTML代码也附上,以更好地展示问题。

  1. import openpyxl as opx
  2. from bs4 import BeautifulSoup
  3. import requests
  4. wb = opx.Workbook()
  5. ws = wb.active
  6. ws.title = "Movies"
  7. header_row = ["Name", "Date", "Rate", "duration"]
  8. ws.append(header_row)
  9. url = "https://www.imdb.com/chart/top/?ref_=nv_mv_250"
  10. header = {
  11. 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
  12. }
  13. response = requests.get(url, headers=header)
  14. html_content = response.content
  15. soup = BeautifulSoup(html_content, "html.parser")
  16. movies = soup.find_all(
  17. "li", class_="ipc-metadata-list-summary-item sc-bca49391-0 eypSaE cli-parent")
  18. for movie in movies:
  19. name = movie.find("h3", class_="ipc-title__text").text.strip()
  20. date = movie.find(
  21. "span", class_="sc-14dd939d-6 kHVqMR cli-title-metadata-item")
  22. rate = movie.find(
  23. "span", class_="ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating")
  24. duration = movie.find("span", class_="ipc-metadata-list-item__duration").text.strip()
  25. print(duration)
  26. # wb.save("sample.xlsx")

在同一个HTML代码中抓取三个不同的文本内容。

英文:

In this code I just want to scrape the movies' duration data but it also scrapes the movies' release dates too!!! I post the HTML codes as well to show you the problem better.

  1. import openpyxl as opx
  2. from bs4 import BeautifulSoup
  3. import requests
  4. wb = opx.Workbook()
  5. ws = wb.active
  6. ws.title = "Movies"
  7. header_row = ["Name", "Date", "Rate", "duration"]
  8. ws.append(header_row)
  9. url = "https://www.imdb.com/chart/top/?ref_=nv_mv_250"
  10. header = {
  11. 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
  12. }
  13. response = requests.get(url, headers=header)
  14. html_content = response.content
  15. soup = BeautifulSoup(html_content, "html.parser")
  16. movies = soup.find_all(
  17. "li", class_="ipc-metadata-list-summary-item sc-bca49391-0 eypSaE cli-parent")
  18. for movie in movies:
  19. name = movie.find("h3", class_="ipc-title__text").text.strip()
  20. date = movie.find(
  21. "span", class_="sc-14dd939d-6 kHVqMR cli-title-metadata-item")
  22. rate = movie.find(
  23. "span", class_="ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating")
  24. print(duration.text)
  25. # wb.save("sample.xlsx")

在同一个HTML代码中抓取三个不同的文本内容。

答案1

得分: 0

有多种方法可以做到这一点。其中一种是使用 find_all,它返回匹配项的列表。

  1. ...
  2. duration = movie.find_all('span', class_='sc-14dd939d-6 kHVqMR cli-title-metadata-item')[1]
  3. ...

下一个方法是使用 selectselect_one

  1. duration = movie.select_one('span.sc-14dd939d-6.kHVqMR.cli-title-metadata-item:nth-child(2)')

selectselect_one 使用 CSS 选择器。在上面的代码中,它搜索 movie 元素内包含 sc-14dd939d-6kHVqMRcli-title-metadata-item 类的第二个 span。请注意,由于它使用 CSS,它会选择包含这三个类的任何 span,即使有更多的类也会被选中。

要在获取文本之前检查第二个元素是否存在:

  1. duration = duration.text.strip() if duration else '-'

你可以在BeautifulSoup文档中找到更多信息。它写得非常好,有很多示例。

英文:

There are multiple ways to do this. One is using find_all which returns list of matches.

  1. ...
  2. duration = movie.find_all('span', class_='sc-14dd939d-6 kHVqMR cli-title-metadata-item')[1]
  3. ...

Next one is using select or select_one.

  1. duration = movie.select_one('span.sc-14dd939d-6.kHVqMR.cli-title-metadata-item:nth-child(2)')

select and select_one uses CSS Selectors. In above code, it searches for 2nd span inside movie element that has classes including sc-14dd939d-6, kHVqMR and cli-title-metadata-item. Note that since it uses css, it selects any span containing those three classes even if there are more classes.

To check if second element is present before taking text:

  1. duration = duration.text.strip() if duration else '-'

You can check more in BeautifulSoup Docs. It is really well-written with numerous examples.

huangapple
  • 本文由 发表于 2023年7月7日 00:27:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76630841.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定