英文:
Scraping three diffrent texts in three same html codes
问题
在这段代码中,我只想抓取电影的时长数据,但它还同时抓取了电影的上映日期!我会将HTML代码也附上,以更好地展示问题。
import openpyxl as opx
from bs4 import BeautifulSoup
import requests
wb = opx.Workbook()
ws = wb.active
ws.title = "Movies"
header_row = ["Name", "Date", "Rate", "duration"]
ws.append(header_row)
url = "https://www.imdb.com/chart/top/?ref_=nv_mv_250"
header = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}
response = requests.get(url, headers=header)
html_content = response.content
soup = BeautifulSoup(html_content, "html.parser")
movies = soup.find_all(
"li", class_="ipc-metadata-list-summary-item sc-bca49391-0 eypSaE cli-parent")
for movie in movies:
name = movie.find("h3", class_="ipc-title__text").text.strip()
date = movie.find(
"span", class_="sc-14dd939d-6 kHVqMR cli-title-metadata-item")
rate = movie.find(
"span", class_="ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating")
duration = movie.find("span", class_="ipc-metadata-list-item__duration").text.strip()
print(duration)
# wb.save("sample.xlsx")
英文:
In this code I just want to scrape the movies' duration data but it also scrapes the movies' release dates too!!! I post the HTML codes as well to show you the problem better.
import openpyxl as opx
from bs4 import BeautifulSoup
import requests
wb = opx.Workbook()
ws = wb.active
ws.title = "Movies"
header_row = ["Name", "Date", "Rate", "duration"]
ws.append(header_row)
url = "https://www.imdb.com/chart/top/?ref_=nv_mv_250"
header = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}
response = requests.get(url, headers=header)
html_content = response.content
soup = BeautifulSoup(html_content, "html.parser")
movies = soup.find_all(
"li", class_="ipc-metadata-list-summary-item sc-bca49391-0 eypSaE cli-parent")
for movie in movies:
name = movie.find("h3", class_="ipc-title__text").text.strip()
date = movie.find(
"span", class_="sc-14dd939d-6 kHVqMR cli-title-metadata-item")
rate = movie.find(
"span", class_="ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating")
print(duration.text)
# wb.save("sample.xlsx")
答案1
得分: 0
有多种方法可以做到这一点。其中一种是使用 find_all
,它返回匹配项的列表。
...
duration = movie.find_all('span', class_='sc-14dd939d-6 kHVqMR cli-title-metadata-item')[1]
...
下一个方法是使用 select
或 select_one
。
duration = movie.select_one('span.sc-14dd939d-6.kHVqMR.cli-title-metadata-item:nth-child(2)')
select
和 select_one
使用 CSS 选择器。在上面的代码中,它搜索 movie
元素内包含 sc-14dd939d-6
、kHVqMR
和 cli-title-metadata-item
类的第二个 span
。请注意,由于它使用 CSS,它会选择包含这三个类的任何 span,即使有更多的类也会被选中。
要在获取文本之前检查第二个元素是否存在:
duration = duration.text.strip() if duration else '-'
你可以在BeautifulSoup文档中找到更多信息。它写得非常好,有很多示例。
英文:
There are multiple ways to do this. One is using find_all
which returns list of matches.
...
duration = movie.find_all('span', class_='sc-14dd939d-6 kHVqMR cli-title-metadata-item')[1]
...
Next one is using select
or select_one
.
duration = movie.select_one('span.sc-14dd939d-6.kHVqMR.cli-title-metadata-item:nth-child(2)')
select
and select_one
uses CSS Selectors. In above code, it searches for 2nd
span
inside movie
element that has classes including sc-14dd939d-6
, kHVqMR
and cli-title-metadata-item
. Note that since it uses css, it selects any span containing those three classes even if there are more classes.
To check if second element is present before taking text:
duration = duration.text.strip() if duration else '-'
You can check more in BeautifulSoup Docs. It is really well-written with numerous examples.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论