2023年4月4日 14:42:57go评论107阅读模式

英文:

Why is R Web scraping code to pick all cast members and directors on the IMDB website not working?

问题

I can help with the translation. Here's the translation of your request:

"我想从IMDB网站的多个页面上获取关于热门尼日利亚电影的电影信息，包括标题、年份、简介、类型和分级等。我已经成功获取了标题、年份、简介、类型和分级等信息。但是，我在获取演员和导演的信息时遇到了问题。

这是IMDB的主要链接
https://www.imdb.com/search/title/?country_of_origin=NG&start=1&ref_=adv_prv

然后，我想进入每部电影的个别页面，并提取演员和主要导演的完整列表。

例如，列表上的第一部电影是"The Trade"，我想进入这个页面：https://www.imdb.com/title/tt8803398/fullcredits/?ref_=tt_cl_sm 并提取所有演员和导演的全名。

这是我用来获取标题、年份、简介、类型和分级的代码："

[代码部分已经翻译，以下是接下来的部分]

"这个代码运行得很好，结果如下图所示：(https://i.stack.imgur.com/xEUpo.jpg)

接下来，这是我尝试获取电影演员完整列表的代码："

[代码部分已经翻译，以下是接下来的部分]

"但是，我得到的结果是这样的。只有每页的第一部电影的演员出现在列表中，而每页的其他49部电影的演员信息都没有被获取。我修改了代码以获取导演的完整列表，但以一种奇怪的方式，它提取出了演员信息，而且仍然有同样的问题。"

[代码部分已经翻译，以下是接下来的部分]

"如果有人能帮助我解决有关获取演员和导演数据的问题，我将不胜感激。我尝试了很多方法都没有成功。"

[标签和问题描述部分不需要翻译，只提供了代码的核心问题描述。]

英文:

I want to scrape data from multiple pages of the IMDB website to get movie information on the Top Nigerian movies by popularity. I have been able to successfully get the title, year, synopsis, genre, certificate. However, I am having issues doing the same for the cast members and directors.

This is the main imdb link
https://www.imdb.com/search/title/?country_of_origin=NG&start=1&ref_=adv_prv

then I want to go into the page of each individual movie and pull out the full list of the cast and main directors

for example, the first movie on the list is "The Trade", I want to go into this page: https://www.imdb.com/title/tt8803398/fullcredits/?ref_=tt_cl_sm and extract the full names of all the cast members and directors,

This is what I did to get the title, year, synopsis, genre, and certificate:

library(rvest)
library(tidyverse)
movies6 = data.frame()
for(page_result in seq(from = 1, to = 201, by = 50)){
  
  link = paste0(&quot;https://www.imdb.com/search/title/?country_of_origin=NG&amp;start=&quot;, page_result, &quot;&amp;ref_=adv_nxt&quot;)
  
  page &lt;- read_html(link)
  df &lt;- page %&gt;% 
  html_nodes(&quot;.mode-advanced&quot;) %&gt;% 
  map_df(~list(title = html_nodes(.x, &#39;.lister-item-header a&#39;) %&gt;% 
                     html_text() %&gt;% 
                     {if(length(.) == 0) NA else .},
               year = html_nodes(.x, &#39;.text-muted.unbold&#39;) %&gt;% 
                     html_text() %&gt;% 
                     {if(length(.) == 0) NA else .},
               genre = html_nodes(.x, &#39;.genre&#39;) %&gt;% 
                     html_text() %&gt;% 
                     {if(length(.) == 0) NA else .},
               certificate = html_nodes(.x, &#39;.certificate&#39;) %&gt;% 
                     html_text() %&gt;% 
                     {if(length(.) == 0) NA else .},
               rating = html_nodes(.x, &#39;.ratings-imdb-rating strong&#39;) %&gt;% 
                     html_text() %&gt;% 
                     {if(length(.) == 0) NA else .},
               synopsis = html_nodes(.x, &#39;.ratings-bar+ .text-muted&#39;) %&gt;% 
                     html_text() %&gt;% 
                     {if(length(.) == 0) NA else .}))
              
movies6 = rbind(movies6, df)
print(paste(&quot;Page:&quot;, page_result))
}

It worked well and this was the result

(https://i.stack.imgur.com/xEUpo.jpg)

Then this is what I attempted to get the complete list of the movie cast

library(rvest)
library(tidyverse)
library(stringr)
get_cast = function(movie_link) {
  movie_page = read_html(movie_link)
  movie_cast = movie_page %&gt;% html_nodes(&quot;.primary_photo+ td a&quot;) %&gt;%
    html_text() %&gt;% paste(collapse = &quot;,&quot;)
  return(movie_cast)
}
movies5 = data.frame()
for(page_result in seq(from = 1, to = 151, by = 50)){
  
  link = paste0(&quot;https://www.imdb.com/search/title/?country_of_origin=NG&amp;start=&quot;, page_result, &quot;&amp;ref_=adv_nxt&quot;)
  
  page &lt;- read_html(link)
  
  movie_links = page %&gt;% html_nodes(&quot;.lister-item-header a&quot;) %&gt;%
    html_attr(&quot;href&quot;) %&gt;%
    str_replace(pattern = fixed(&quot;?ref_=adv_li_tt&quot;), replacement = fixed(&quot;fullcredits/?ref_=tt_cl_sm&quot;)) %&gt;%
    paste(&quot;http://www.imdb.com&quot;, ., sep=&quot;&quot;)
  
  cast = sapply(movie_links, FUN = get_cast, USE.NAMES = FALSE)
  movies5 = rbind(movies5, data.frame(cast = ifelse(length(cast)==0,NA,cast)))
print(paste(&quot;Page:&quot;, page_result))
}

But this is the result I am getting. Only the cast of the first movie per page is populating the list. The cast of the remaining 49 movies of each page isn't working. I modified the code to get the complete list of directors, but in a weird way, it brings out the cast instead, with the same issue as before.

(https://i.stack.imgur.com/XCMaJ.jpg)

I would really appreciate it if someone could assist me on what exactly to do regarding scraping data on the cast and directors. I have tried so many things that didn't work.

[tag:rvest] [tag:web-scraping] [tag:R] [tag:IMDB] [tag:stringr] [tag:web-scraping-multiple-pages]

答案1

得分: 0

在第二个循环内的部分是什么？

cast &lt;- page %&gt;% 
  html_nodes(&quot;.lister-item-content&quot;) %&gt;% 
  html_nodes(&quot;p:nth-child(5)&quot;) %&gt;% 
  html_text() %&gt;% 
  stringr::str_remove_all(&quot;\n&quot;) %&gt;% 
  stringr::str_extract(&quot;(?&lt;=Stars:).*&quot;) %&gt;%
  str_squish()

这是它的样子：

&gt; head(cast, 15)
 [1] &quot;Alexander Abolore, Anthony Abraham, Toyin Abraham, Lateef Adedimeji&quot;      
 [2] &quot;Rosie Afuwape, Onyii Alex, Nancy Chibuike, Ray Emodi&quot;                     
 [3] &quot;Etochi Asiegbu, Oge Asiegbu, Femi Branch, Monalisa Chinda&quot;                
 [4] &quot;Ayo Adesanya Hassan, Chinonso Arubayi, Alex Ayalogu, Rita Edward&quot;         
 [5] &quot;Melat Abera, Toba Aboyeji, Adunni Ade, Adebowale Adedayo&quot;                 
 [6] &quot;Gbenga Titiloye, Elvina Ibru, Osas Ighodaro, Sharon Ooja&quot;                 
 [7] &quot;Victor Agbu, Cynthia Ifeoma Amadiude, Norbert Asikhia, Roseanne Chikwendu&quot;
 [8] &quot;Winston Ajaelo, Kerry Amadi, Tchidi Chikere, Ikenna Ezeh&quot;                 
 [9] &quot;Ayenuro Ademola, Margaret Adewunmi, Rahila Ahmed, Abiola Atanda&quot;          
[10] &quot;Osas Ighodaro, Bolanle Ninalowo, Paul Utomi, Adunni Ade&quot;                  
[11] &quot;Kelvin Boateng, Nadia Buari, Pascaline Edwards, Jason El-Agha&quot;            
[12] &quot;Uzor Arukwe, Shalewa Ashafa, Tobi Bakre, Demi Banwo&quot;                      
[13] &quot;Mary Ann Apollo, Anita Enoyi, Joy Igbanugo, Desmond A. Ken&quot;               
[14] &quot;Nana Abdulmalik, Bimbo Ademoye, Ajakaya Aliyah, Monsuru Amodu&quot;            
[15] &quot;Regina Askia, Sola Fosudo, Pete Edochie, Dolly Unachukwu&quot;

英文:

What about this inside the second loop?

cast &lt;- page %&gt;% 
  html_nodes(&quot;.lister-item-content&quot;) %&gt;% 
  html_nodes(&quot;p:nth-child(5)&quot;) %&gt;% 
  html_text() %&gt;% 
  stringr::str_remove_all(&quot;\n&quot;) %&gt;% 
  stringr::str_extract(&quot;(?&lt;=Stars:).*&quot;) %&gt;%
  str_squish()

Here's what it looks like:

&gt; head(cast, 15)
 [1] &quot;Alexander Abolore, Anthony Abraham, Toyin Abraham, Lateef Adedimeji&quot;      
 [2] &quot;Rosie Afuwape, Onyii Alex, Nancy Chibuike, Ray Emodi&quot;                     
 [3] &quot;Etochi Asiegbu, Oge Asiegbu, Femi Branch, Monalisa Chinda&quot;                
 [4] &quot;Ayo Adesanya Hassan, Chinonso Arubayi, Alex Ayalogu, Rita Edward&quot;         
 [5] &quot;Melat Abera, Toba Aboyeji, Adunni Ade, Adebowale Adedayo&quot;                 
 [6] &quot;Gbenga Titiloye, Elvina Ibru, Osas Ighodaro, Sharon Ooja&quot;                 
 [7] &quot;Victor Agbu, Cynthia Ifeoma Amadiude, Norbert Asikhia, Roseanne Chikwendu&quot;
 [8] &quot;Winston Ajaelo, Kerry Amadi, Tchidi Chikere, Ikenna Ezeh&quot;                 
 [9] &quot;Ayenuro Ademola, Margaret Adewunmi, Rahila Ahmed, Abiola Atanda&quot;          
[10] &quot;Osas Ighodaro, Bolanle Ninalowo, Paul Utomi, Adunni Ade&quot;                  
[11] &quot;Kelvin Boateng, Nadia Buari, Pascaline Edwards, Jason El-Agha&quot;            
[12] &quot;Uzor Arukwe, Shalewa Ashafa, Tobi Bakre, Demi Banwo&quot;                      
[13] &quot;Mary Ann Apollo, Anita Enoyi, Joy Igbanugo, Desmond A. Ken&quot;               
[14] &quot;Nana Abdulmalik, Bimbo Ademoye, Ajakaya Aliyah, Monsuru Amodu&quot;            
[15] &quot;Regina Askia, Sola Fosudo, Pete Edochie, Dolly Unachukwu&quot;

答案2

得分: 0

我能够做到这一点，它有效。

英文:

I was able to do this and it worked

get_cast = function(movie_link) {
      movie_page = read_html(movie_link)
      cast = movie_page %&gt;% html_nodes(&quot;.cast_list tr:not(:first-child) td:nth-child(2) a&quot;) %&gt;% html_text() %&gt;% paste(collapse = &quot;,&quot;)
      directors = movie_page %&gt;% html_nodes(&quot;h4:contains(&#39;Directed by&#39;) + table a&quot;) %&gt;% html_text() %&gt;% paste(collapse = &quot;,&quot;)
      return(data.frame(cast = cast, directors = directors))
    }
movies2 = data.frame()
for(page_result in seq(from = 1, to = 951, by = 50)){
      link = paste0(&quot;https://imdb.com/search/title/?country_of_origin=NG&amp;start=&quot;, page_result, &quot;&amp;ref_=adv_nxt&quot;)
      page &lt;- read_html(link)
      movie_links = page %&gt;% html_nodes(&quot;.lister-item-header a&quot;) %&gt;% html_attr(&quot;href&quot;) %&gt;% str_replace(pattern = fixed(&quot;?ref_=adv_li_tt&quot;), replacement = fixed(&quot;fullcredits/?ref_=tt_cl_sm&quot;)) %&gt;%
        paste(&quot;http://www.imdb.com&quot;, ., sep=&quot;&quot;)
      movie_data = lapply(movie_links, get_cast)
      df = bind_rows(movie_data)
      movies2 = rbind(movies2, df)
    print(paste(&quot;Page:&quot;, page_result))
    }

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

R Web scraping code to pick all cast members and directors on the IMDB website not working?

问题

答案1

答案2

获取字典顺序中的前导元素。

使用二项分布进行统计分析。

将字符串列通用分割成多个列，使用 data.table

为什么我的变异函数只对单个数字值起作用

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论