2023年7月7日 00:27:17go评论98阅读模式

英文:

Scraping three diffrent texts in three same html codes

问题

在这段代码中，我只想抓取电影的时长数据，但它还同时抓取了电影的上映日期！我会将HTML代码也附上，以更好地展示问题。

import openpyxl as opx
from bs4 import BeautifulSoup
import requests
wb = opx.Workbook()
ws = wb.active
ws.title = "Movies"
header_row = ["Name", "Date", "Rate", "duration"]
ws.append(header_row)
url = "https://www.imdb.com/chart/top/?ref_=nv_mv_250"
header = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}
response = requests.get(url, headers=header)
html_content = response.content
soup = BeautifulSoup(html_content, "html.parser")
movies = soup.find_all(
    "li", class_="ipc-metadata-list-summary-item sc-bca49391-0 eypSaE cli-parent")
for movie in movies:
    name = movie.find("h3", class_="ipc-title__text").text.strip()
    date = movie.find(
        "span", class_="sc-14dd939d-6 kHVqMR cli-title-metadata-item")
    rate = movie.find(
        "span", class_="ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating")
    duration = movie.find("span", class_="ipc-metadata-list-item__duration").text.strip()
    print(duration)
# wb.save("sample.xlsx")

在同一个HTML代码中抓取三个不同的文本内容。

英文:

In this code I just want to scrape the movies' duration data but it also scrapes the movies' release dates too!!! I post the HTML codes as well to show you the problem better.

import openpyxl as opx
from bs4 import BeautifulSoup
import requests
wb = opx.Workbook()
ws = wb.active
ws.title = &quot;Movies&quot;
header_row = [&quot;Name&quot;, &quot;Date&quot;, &quot;Rate&quot;, &quot;duration&quot;]
ws.append(header_row)
url = &quot;https://www.imdb.com/chart/top/?ref_=nv_mv_250&quot;
header = {
    &#39;User-Agent&#39;: &#39;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36&#39;
}
response = requests.get(url, headers=header)
html_content = response.content
soup = BeautifulSoup(html_content, &quot;html.parser&quot;)
movies = soup.find_all(
    &quot;li&quot;, class_=&quot;ipc-metadata-list-summary-item sc-bca49391-0 eypSaE cli-parent&quot;)
for movie in movies:
    name = movie.find(&quot;h3&quot;, class_=&quot;ipc-title__text&quot;).text.strip()
    date = movie.find(
        &quot;span&quot;, class_=&quot;sc-14dd939d-6 kHVqMR cli-title-metadata-item&quot;)
    rate = movie.find(
        &quot;span&quot;, class_=&quot;ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating&quot;)
    print(duration.text)
# wb.save(&quot;sample.xlsx&quot;)

在同一个HTML代码中抓取三个不同的文本内容。

答案1

得分: 0

有多种方法可以做到这一点。其中一种是使用 find_all，它返回匹配项的列表。

...
duration = movie.find_all('span', class_='sc-14dd939d-6 kHVqMR cli-title-metadata-item')[1]
...

下一个方法是使用 select 或 select_one。

duration = movie.select_one('span.sc-14dd939d-6.kHVqMR.cli-title-metadata-item:nth-child(2)')

select 和 select_one 使用 CSS 选择器。在上面的代码中，它搜索 movie 元素内包含 sc-14dd939d-6、kHVqMR 和 cli-title-metadata-item 类的第二个 span。请注意，由于它使用 CSS，它会选择包含这三个类的任何 span，即使有更多的类也会被选中。

要在获取文本之前检查第二个元素是否存在：

duration = duration.text.strip() if duration else '-'

你可以在BeautifulSoup文档中找到更多信息。它写得非常好，有很多示例。

英文:

There are multiple ways to do this. One is using find_all which returns list of matches.

...
duration = movie.find_all(&#39;span&#39;, class_=&#39;sc-14dd939d-6 kHVqMR cli-title-metadata-item&#39;)[1]
...

Next one is using select or select_one.

duration = movie.select_one(&#39;span.sc-14dd939d-6.kHVqMR.cli-title-metadata-item:nth-child(2)&#39;)

select and select_one uses CSS Selectors. In above code, it searches for 2nd span inside movie element that has classes including sc-14dd939d-6, kHVqMR and cli-title-metadata-item. Note that since it uses css, it selects any span containing those three classes even if there are more classes.

To check if second element is present before taking text:

duration = duration.text.strip() if duration else &#39;-&#39;

You can check more in BeautifulSoup Docs. It is really well-written with numerous examples.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在同一个HTML代码中抓取三个不同的文本内容。

问题

答案1

在Python中，不使用Pandas，从元组列表中删除部分重复的最短方式是什么？

Implement MultiKeyDict class in Python with alias() method for creating aliases. Existing code fails when original key is deleted. Need fix

如何从HTML代码块中提取日期时间

为什么我的滚动adler32校验和在Go语言中不起作用？（模运算）

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论