问题

我想提取在此页面上显示的电影列表的投票和总票房信息。我是Python新手，所以感谢您的帮助。请注意，并非所有电影都有总票房信息，所以这对我来说有点棘手。此外，span元素的name属性为'nv'也用于其他字段。请告诉我使用正则表达式或子/下一个元素的正确方法。

我希望找到一种方法，首先搜索具有文本“ Gross:”的span元素，如果找到，捕获下一个子span元素。如果未找到，为特定行添加N/A条目。

import pandas as pd
import requests
import re
from bs4 import BeautifulSoup

url = 'https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

movie_name = []
votes = []
gross = []

movie_data = soup.findAll('div', attrs={'class': 'lister-item mode-advanced'})

for store in movie_data:
    name = store.h3.a.text
    movie_name.append(name)

P.S: 尝试使用一些代码，但它们都引发了各种错误。

英文:

I would like to extract Votes and Gross revenue for the list of movies shown on this page

https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating

I am new to python so you help you appreciated. Notice not all movies have the Gross revenue so this is tricky for me. Also the span name='nv' element is also for other fields . Please let me know the correct approach using either regex or child/next element.
I was hoping to find a method to first search for span having text as" Gross:" if found, capture the next child span element . If not found add a N/A entry for the particular row .

import pandas as pd
import requests
import re
from bs4 import BeautifulSoup
url = &#39;https://www.imdb.com/search/title/?count=100&amp;groups=top_1000&amp;sort=user_rating&#39;
respose = requests.get(url)
soup = BeautifulSoup(respose.content, &#39;html.parser&#39;)



movie_name = []
votes = []
gross = []

movie_data = soup.findAll(&#39;div&#39;, attrs= {&#39;class&#39;: &#39;lister-item mode-advanced&#39;})

for store in movie_data:
    name = store.h3.a.text
    movie_name.append(name)





P.S : Tried using some codes but they all throwed up some or the other errors

</details>


# 答案1
**得分**: 0

要获取 *总收入*，请尝试选择包含词语 `&quot;Gross:&quot;` 的 `&lt;span&gt;`，然后选择下一个同级元素。如果找不到此标记，则将总收入赋予一些默认值（`-`）：

```py
import requests
from bs4 import BeautifulSoup

url = 'https://www.imdb.com/search/title/?count=100&amp;groups=top_1000&amp;sort=user_rating'
headers = {'Accept-Language' : 'en-US,en;q=0.5'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

for item in soup.select('.lister-item-content'):
    title = item.h3.get_text(strip=True, separator=' ')
    rating = item.select_one('.ratings-imdb-rating')['data-value']
    revenue = item.select_one('span:-soup-contains("Gross:") + span')
    revenue = revenue.text if revenue else '-'
    print('{:<50} {:<5} {:<10}'.format(title[:50], rating, revenue))

打印结果：

1. The Shawshank Redemption (1994)                 9.3   $28.34M   
2. The Godfather (1972)                            9.2   $134.97M  
3. The Dark Knight (2008)                          9     $534.86M  
4. The Godfather Part II (1974)                    9     $57.30M   
5. Schindler's List (1993)                         9     $96.90M   
6. 12 Angry Men (1957)                             9     $4.36M    
7. The Lord of the Rings: The Return of the King ( 9     $377.85M  
8. Pulp Fiction (1994)                             8.9   $107.93M  
9. 777 Charlie (2022)                              8.9   -         
10. The Lord of the Rings: The Fellowship of the R 8.8   $315.54M  
11. Inception (2010)                               8.8   $292.58M  

...以此类推。

英文:

To get Gross revenue try to select <span> which contains the word "Gross:" and then the next sibling. If this tag doesn't exits, assign some default value (-) to the gross revenue:

import requests
from bs4 import BeautifulSoup


url = &#39;https://www.imdb.com/search/title/?count=100&amp;groups=top_1000&amp;sort=user_rating&#39;
headers = {&#39;Accept-Language&#39; : &#39;en-US,en;q=0.5&#39;}

soup = BeautifulSoup(requests.get(url, headers=headers).content, &#39;html.parser&#39;)

for item in soup.select(&#39;.lister-item-content&#39;):
    title = item.h3.get_text(strip=True, separator=&#39; &#39;)
    rating = item.select_one(&#39;.ratings-imdb-rating&#39;)[&#39;data-value&#39;]
    revenue = item.select_one(&#39;span:-soup-contains(&quot;Gross:&quot;) + span&#39;)
    revenue = revenue.text if revenue else &#39;-&#39;
    print(&#39;{:&lt;50} {:&lt;5} {:&lt;10}&#39;.format(title[:50], rating, revenue))

Prints:

1. The Shawshank Redemption (1994)                 9.3   $28.34M   
2. The Godfather (1972)                            9.2   $134.97M  
3. The Dark Knight (2008)                          9     $534.86M  
4. The Godfather Part II (1974)                    9     $57.30M   
5. Schindler&#39;s List (1993)                         9     $96.90M   
6. 12 Angry Men (1957)                             9     $4.36M    
7. The Lord of the Rings: The Return of the King ( 9     $377.85M  
8. Pulp Fiction (1994)                             8.9   $107.93M  
9. 777 Charlie (2022)                              8.9   -         
10. The Lord of the Rings: The Fellowship of the R 8.8   $315.54M  
11. Inception (2010)                               8.8   $292.58M  

...and so on.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用正则表达式或`next_child_element`来查找正确的标签元素（Beautiful Soup）。

问题

如何在点击按钮时获取出现的数据？

如何使用Python检查网站是否使用WordPress编写？

0x0 使用Julia时使用try和catch的数据框架

优化爬虫

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论