英文:
Finding correct tag element with regex or next_child_element ( Beautifulsoup)
问题
我想提取在此页面上显示的电影列表的投票和总票房信息。我是Python新手,所以感谢您的帮助。请注意,并非所有电影都有总票房信息,所以这对我来说有点棘手。此外,span元素的name属性为'nv'也用于其他字段。请告诉我使用正则表达式或子/下一个元素的正确方法。
我希望找到一种方法,首先搜索具有文本“ Gross:”的span元素,如果找到,捕获下一个子span元素。如果未找到,为特定行添加N/A条目。
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
movie_name = []
votes = []
gross = []
movie_data = soup.findAll('div', attrs={'class': 'lister-item mode-advanced'})
for store in movie_data:
name = store.h3.a.text
movie_name.append(name)
P.S: 尝试使用一些代码,但它们都引发了各种错误。
英文:
I would like to extract Votes and Gross revenue for the list of movies shown on this page
https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating
I am new to python so you help you appreciated. Notice not all movies have the Gross revenue so this is tricky for me. Also the span name='nv' element is also for other fields . Please let me know the correct approach using either regex or child/next element.
I was hoping to find a method to first search for span having text as" Gross:" if found, capture the next child span element . If not found add a N/A entry for the particular row .
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating'
respose = requests.get(url)
soup = BeautifulSoup(respose.content, 'html.parser')
movie_name = []
votes = []
gross = []
movie_data = soup.findAll('div', attrs= {'class': 'lister-item mode-advanced'})
for store in movie_data:
name = store.h3.a.text
movie_name.append(name)
P.S : Tried using some codes but they all throwed up some or the other errors
</details>
# 答案1
**得分**: 0
要获取 *总收入*,请尝试选择包含词语 `"Gross:"` 的 `<span>`,然后选择下一个同级元素。如果找不到此标记,则将总收入赋予一些默认值(`-`):
```py
import requests
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating'
headers = {'Accept-Language' : 'en-US,en;q=0.5'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
for item in soup.select('.lister-item-content'):
title = item.h3.get_text(strip=True, separator=' ')
rating = item.select_one('.ratings-imdb-rating')['data-value']
revenue = item.select_one('span:-soup-contains("Gross:") + span')
revenue = revenue.text if revenue else '-'
print('{:<50} {:<5} {:<10}'.format(title[:50], rating, revenue))
打印结果:
1. The Shawshank Redemption (1994) 9.3 $28.34M
2. The Godfather (1972) 9.2 $134.97M
3. The Dark Knight (2008) 9 $534.86M
4. The Godfather Part II (1974) 9 $57.30M
5. Schindler's List (1993) 9 $96.90M
6. 12 Angry Men (1957) 9 $4.36M
7. The Lord of the Rings: The Return of the King ( 9 $377.85M
8. Pulp Fiction (1994) 8.9 $107.93M
9. 777 Charlie (2022) 8.9 -
10. The Lord of the Rings: The Fellowship of the R 8.8 $315.54M
11. Inception (2010) 8.8 $292.58M
...以此类推。
英文:
To get Gross revenue try to select <span>
which contains the word "Gross:"
and then the next sibling. If this tag doesn't exits, assign some default value (-
) to the gross revenue:
import requests
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating'
headers = {'Accept-Language' : 'en-US,en;q=0.5'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
for item in soup.select('.lister-item-content'):
title = item.h3.get_text(strip=True, separator=' ')
rating = item.select_one('.ratings-imdb-rating')['data-value']
revenue = item.select_one('span:-soup-contains("Gross:") + span')
revenue = revenue.text if revenue else '-'
print('{:<50} {:<5} {:<10}'.format(title[:50], rating, revenue))
Prints:
1. The Shawshank Redemption (1994) 9.3 $28.34M
2. The Godfather (1972) 9.2 $134.97M
3. The Dark Knight (2008) 9 $534.86M
4. The Godfather Part II (1974) 9 $57.30M
5. Schindler's List (1993) 9 $96.90M
6. 12 Angry Men (1957) 9 $4.36M
7. The Lord of the Rings: The Return of the King ( 9 $377.85M
8. Pulp Fiction (1994) 8.9 $107.93M
9. 777 Charlie (2022) 8.9 -
10. The Lord of the Rings: The Fellowship of the R 8.8 $315.54M
11. Inception (2010) 8.8 $292.58M
...and so on.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论