使用正则表达式或`next_child_element`来查找正确的标签元素(Beautiful Soup)。

huangapple go评论45阅读模式
英文:

Finding correct tag element with regex or next_child_element ( Beautifulsoup)

问题

我想提取在此页面上显示的电影列表的投票和总票房信息。我是Python新手,所以感谢您的帮助。请注意,并非所有电影都有总票房信息,所以这对我来说有点棘手。此外,span元素的name属性为'nv'也用于其他字段。请告诉我使用正则表达式或子/下一个元素的正确方法。

我希望找到一种方法,首先搜索具有文本“ Gross:”的span元素,如果找到,捕获下一个子span元素。如果未找到,为特定行添加N/A条目。

import pandas as pd
import requests
import re
from bs4 import BeautifulSoup

url = 'https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

movie_name = []
votes = []
gross = []

movie_data = soup.findAll('div', attrs={'class': 'lister-item mode-advanced'})

for store in movie_data:
    name = store.h3.a.text
    movie_name.append(name)

P.S: 尝试使用一些代码,但它们都引发了各种错误。

英文:

I would like to extract Votes and Gross revenue for the list of movies shown on this page

https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating

I am new to python so you help you appreciated. Notice not all movies have the Gross revenue so this is tricky for me. Also the span name='nv' element is also for other fields . Please let me know the correct approach using either regex or child/next element.
I was hoping to find a method to first search for span having text as" Gross:" if found, capture the next child span element . If not found add a N/A entry for the particular row .

import pandas as pd
import requests
import re
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating'
respose = requests.get(url)
soup = BeautifulSoup(respose.content, 'html.parser')



movie_name = []
votes = []
gross = []

movie_data = soup.findAll('div', attrs= {'class': 'lister-item mode-advanced'})

for store in movie_data:
    name = store.h3.a.text
    movie_name.append(name)





P.S : Tried using some codes but they all throwed up some or the other errors

</details>


# 答案1
**得分**: 0

要获取 *总收入*,请尝试选择包含词语 `&quot;Gross:&quot;` 的 `&lt;span&gt;`,然后选择下一个同级元素。如果找不到此标记,则将总收入赋予一些默认值(`-`):

```py
import requests
from bs4 import BeautifulSoup

url = 'https://www.imdb.com/search/title/?count=100&amp;groups=top_1000&amp;sort=user_rating'
headers = {'Accept-Language' : 'en-US,en;q=0.5'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

for item in soup.select('.lister-item-content'):
    title = item.h3.get_text(strip=True, separator=' ')
    rating = item.select_one('.ratings-imdb-rating')['data-value']
    revenue = item.select_one('span:-soup-contains("Gross:") + span')
    revenue = revenue.text if revenue else '-'
    print('{:<50} {:<5} {:<10}'.format(title[:50], rating, revenue))

打印结果:

1. The Shawshank Redemption (1994)                 9.3   $28.34M   
2. The Godfather (1972)                            9.2   $134.97M  
3. The Dark Knight (2008)                          9     $534.86M  
4. The Godfather Part II (1974)                    9     $57.30M   
5. Schindler's List (1993)                         9     $96.90M   
6. 12 Angry Men (1957)                             9     $4.36M    
7. The Lord of the Rings: The Return of the King ( 9     $377.85M  
8. Pulp Fiction (1994)                             8.9   $107.93M  
9. 777 Charlie (2022)                              8.9   -         
10. The Lord of the Rings: The Fellowship of the R 8.8   $315.54M  
11. Inception (2010)                               8.8   $292.58M  

...以此类推。
英文:

To get Gross revenue try to select &lt;span&gt; which contains the word &quot;Gross:&quot; and then the next sibling. If this tag doesn't exits, assign some default value (-) to the gross revenue:

import requests
from bs4 import BeautifulSoup


url = &#39;https://www.imdb.com/search/title/?count=100&amp;groups=top_1000&amp;sort=user_rating&#39;
headers = {&#39;Accept-Language&#39; : &#39;en-US,en;q=0.5&#39;}

soup = BeautifulSoup(requests.get(url, headers=headers).content, &#39;html.parser&#39;)

for item in soup.select(&#39;.lister-item-content&#39;):
    title = item.h3.get_text(strip=True, separator=&#39; &#39;)
    rating = item.select_one(&#39;.ratings-imdb-rating&#39;)[&#39;data-value&#39;]
    revenue = item.select_one(&#39;span:-soup-contains(&quot;Gross:&quot;) + span&#39;)
    revenue = revenue.text if revenue else &#39;-&#39;
    print(&#39;{:&lt;50} {:&lt;5} {:&lt;10}&#39;.format(title[:50], rating, revenue))

Prints:

1. The Shawshank Redemption (1994)                 9.3   $28.34M   
2. The Godfather (1972)                            9.2   $134.97M  
3. The Dark Knight (2008)                          9     $534.86M  
4. The Godfather Part II (1974)                    9     $57.30M   
5. Schindler&#39;s List (1993)                         9     $96.90M   
6. 12 Angry Men (1957)                             9     $4.36M    
7. The Lord of the Rings: The Return of the King ( 9     $377.85M  
8. Pulp Fiction (1994)                             8.9   $107.93M  
9. 777 Charlie (2022)                              8.9   -         
10. The Lord of the Rings: The Fellowship of the R 8.8   $315.54M  
11. Inception (2010)                               8.8   $292.58M  

...and so on.

huangapple
  • 本文由 发表于 2023年2月27日 06:53:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/75575484.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定