R Web scraping code to pick all cast members and directors on the IMDB website not working?

huangapple go评论79阅读模式
英文:

Why is R Web scraping code to pick all cast members and directors on the IMDB website not working?

问题

I can help with the translation. Here's the translation of your request:

"我想从IMDB网站的多个页面上获取关于热门尼日利亚电影的电影信息,包括标题、年份、简介、类型和分级等。我已经成功获取了标题、年份、简介、类型和分级等信息。但是,我在获取演员和导演的信息时遇到了问题。

这是IMDB的主要链接
https://www.imdb.com/search/title/?country_of_origin=NG&start=1&ref_=adv_prv

然后,我想进入每部电影的个别页面,并提取演员和主要导演的完整列表。

例如,列表上的第一部电影是"The Trade",我想进入这个页面:https://www.imdb.com/title/tt8803398/fullcredits/?ref_=tt_cl_sm 并提取所有演员和导演的全名。

这是我用来获取标题、年份、简介、类型和分级的代码:"

[代码部分已经翻译,以下是接下来的部分]

"这个代码运行得很好,结果如下图所示:(https://i.stack.imgur.com/xEUpo.jpg)

接下来,这是我尝试获取电影演员完整列表的代码:"

[代码部分已经翻译,以下是接下来的部分]

"但是,我得到的结果是这样的。只有每页的第一部电影的演员出现在列表中,而每页的其他49部电影的演员信息都没有被获取。我修改了代码以获取导演的完整列表,但以一种奇怪的方式,它提取出了演员信息,而且仍然有同样的问题。"

[代码部分已经翻译,以下是接下来的部分]

"如果有人能帮助我解决有关获取演员和导演数据的问题,我将不胜感激。我尝试了很多方法都没有成功。"

[标签和问题描述部分不需要翻译,只提供了代码的核心问题描述。]

英文:

I want to scrape data from multiple pages of the IMDB website to get movie information on the Top Nigerian movies by popularity. I have been able to successfully get the title, year, synopsis, genre, certificate. However, I am having issues doing the same for the cast members and directors.

This is the main imdb link
https://www.imdb.com/search/title/?country_of_origin=NG&start=1&ref_=adv_prv

then I want to go into the page of each individual movie and pull out the full list of the cast and main directors

for example, the first movie on the list is "The Trade", I want to go into this page: https://www.imdb.com/title/tt8803398/fullcredits/?ref_=tt_cl_sm and extract the full names of all the cast members and directors,

This is what I did to get the title, year, synopsis, genre, and certificate:

library(rvest)
library(tidyverse)

movies6 = data.frame()

for(page_result in seq(from = 1, to = 201, by = 50)){
  
  link = paste0("https://www.imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
  
  page <- read_html(link)

  df <- page %>% 
  html_nodes(".mode-advanced") %>% 
  map_df(~list(title = html_nodes(.x, '.lister-item-header a') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               year = html_nodes(.x, '.text-muted.unbold') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               genre = html_nodes(.x, '.genre') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               certificate = html_nodes(.x, '.certificate') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               rating = html_nodes(.x, '.ratings-imdb-rating strong') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               synopsis = html_nodes(.x, '.ratings-bar+ .text-muted') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .}))
              

movies6 = rbind(movies6, df)
print(paste("Page:", page_result))

}

It worked well and this was the result

(https://i.stack.imgur.com/xEUpo.jpg)

Then this is what I attempted to get the complete list of the movie cast

library(rvest)
library(tidyverse)
library(stringr)


get_cast = function(movie_link) {
  movie_page = read_html(movie_link)
  movie_cast = movie_page %>% html_nodes(".primary_photo+ td a") %>%
    html_text() %>% paste(collapse = ",")
  return(movie_cast)
}

movies5 = data.frame()

for(page_result in seq(from = 1, to = 151, by = 50)){
  
  link = paste0("https://www.imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
  
  page <- read_html(link)
  
  movie_links = page %>% html_nodes(".lister-item-header a") %>%
    html_attr("href") %>%
    str_replace(pattern = fixed("?ref_=adv_li_tt"), replacement = fixed("fullcredits/?ref_=tt_cl_sm")) %>%
    paste("http://www.imdb.com", ., sep="")
  

  cast = sapply(movie_links, FUN = get_cast, USE.NAMES = FALSE)

  movies5 = rbind(movies5, data.frame(cast = ifelse(length(cast)==0,NA,cast)))


print(paste("Page:", page_result))

}

But this is the result I am getting. Only the cast of the first movie per page is populating the list. The cast of the remaining 49 movies of each page isn't working. I modified the code to get the complete list of directors, but in a weird way, it brings out the cast instead, with the same issue as before.

(https://i.stack.imgur.com/XCMaJ.jpg)

I would really appreciate it if someone could assist me on what exactly to do regarding scraping data on the cast and directors. I have tried so many things that didn't work.

[tag:rvest] [tag:web-scraping] [tag:R] [tag:IMDB] [tag:stringr] [tag:web-scraping-multiple-pages]

答案1

得分: 0

在第二个循环内的部分是什么?

cast <- page %>% 
  html_nodes(".lister-item-content") %>% 
  html_nodes("p:nth-child(5)") %>% 
  html_text() %>% 
  stringr::str_remove_all("\n") %>% 
  stringr::str_extract("(?<=Stars:).*") %>%
  str_squish()

这是它的样子:

> head(cast, 15)
 [1] "Alexander Abolore, Anthony Abraham, Toyin Abraham, Lateef Adedimeji"      
 [2] "Rosie Afuwape, Onyii Alex, Nancy Chibuike, Ray Emodi"                     
 [3] "Etochi Asiegbu, Oge Asiegbu, Femi Branch, Monalisa Chinda"                
 [4] "Ayo Adesanya Hassan, Chinonso Arubayi, Alex Ayalogu, Rita Edward"         
 [5] "Melat Abera, Toba Aboyeji, Adunni Ade, Adebowale Adedayo"                 
 [6] "Gbenga Titiloye, Elvina Ibru, Osas Ighodaro, Sharon Ooja"                 
 [7] "Victor Agbu, Cynthia Ifeoma Amadiude, Norbert Asikhia, Roseanne Chikwendu"
 [8] "Winston Ajaelo, Kerry Amadi, Tchidi Chikere, Ikenna Ezeh"                 
 [9] "Ayenuro Ademola, Margaret Adewunmi, Rahila Ahmed, Abiola Atanda"          
[10] "Osas Ighodaro, Bolanle Ninalowo, Paul Utomi, Adunni Ade"                  
[11] "Kelvin Boateng, Nadia Buari, Pascaline Edwards, Jason El-Agha"            
[12] "Uzor Arukwe, Shalewa Ashafa, Tobi Bakre, Demi Banwo"                      
[13] "Mary Ann Apollo, Anita Enoyi, Joy Igbanugo, Desmond A. Ken"               
[14] "Nana Abdulmalik, Bimbo Ademoye, Ajakaya Aliyah, Monsuru Amodu"            
[15] "Regina Askia, Sola Fosudo, Pete Edochie, Dolly Unachukwu"       
英文:

What about this inside the second loop?

cast <- page %>% 
  html_nodes(".lister-item-content") %>% 
  html_nodes("p:nth-child(5)") %>% 
  html_text() %>% 
  stringr::str_remove_all("\n") %>% 
  stringr::str_extract("(?<=Stars:).*") %>%
  str_squish()

Here's what it looks like:

> head(cast, 15)
 [1] "Alexander Abolore, Anthony Abraham, Toyin Abraham, Lateef Adedimeji"      
 [2] "Rosie Afuwape, Onyii Alex, Nancy Chibuike, Ray Emodi"                     
 [3] "Etochi Asiegbu, Oge Asiegbu, Femi Branch, Monalisa Chinda"                
 [4] "Ayo Adesanya Hassan, Chinonso Arubayi, Alex Ayalogu, Rita Edward"         
 [5] "Melat Abera, Toba Aboyeji, Adunni Ade, Adebowale Adedayo"                 
 [6] "Gbenga Titiloye, Elvina Ibru, Osas Ighodaro, Sharon Ooja"                 
 [7] "Victor Agbu, Cynthia Ifeoma Amadiude, Norbert Asikhia, Roseanne Chikwendu"
 [8] "Winston Ajaelo, Kerry Amadi, Tchidi Chikere, Ikenna Ezeh"                 
 [9] "Ayenuro Ademola, Margaret Adewunmi, Rahila Ahmed, Abiola Atanda"          
[10] "Osas Ighodaro, Bolanle Ninalowo, Paul Utomi, Adunni Ade"                  
[11] "Kelvin Boateng, Nadia Buari, Pascaline Edwards, Jason El-Agha"            
[12] "Uzor Arukwe, Shalewa Ashafa, Tobi Bakre, Demi Banwo"                      
[13] "Mary Ann Apollo, Anita Enoyi, Joy Igbanugo, Desmond A. Ken"               
[14] "Nana Abdulmalik, Bimbo Ademoye, Ajakaya Aliyah, Monsuru Amodu"            
[15] "Regina Askia, Sola Fosudo, Pete Edochie, Dolly Unachukwu"       

答案2

得分: 0

我能够做到这一点,它有效。

英文:

I was able to do this and it worked

get_cast = function(movie_link) {
      movie_page = read_html(movie_link)
      cast = movie_page %>% html_nodes(".cast_list tr:not(:first-child) td:nth-child(2) a") %>% html_text() %>% paste(collapse = ",")
      directors = movie_page %>% html_nodes("h4:contains('Directed by') + table a") %>% html_text() %>% paste(collapse = ",")
      return(data.frame(cast = cast, directors = directors))
    }

movies2 = data.frame()

for(page_result in seq(from = 1, to = 951, by = 50)){
      link = paste0("https://imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
      page <- read_html(link)
      movie_links = page %>% html_nodes(".lister-item-header a") %>% html_attr("href") %>% str_replace(pattern = fixed("?ref_=adv_li_tt"), replacement = fixed("fullcredits/?ref_=tt_cl_sm")) %>%
        paste("http://www.imdb.com", ., sep="")
      movie_data = lapply(movie_links, get_cast)
      df = bind_rows(movie_data)
      movies2 = rbind(movies2, df)


    print(paste("Page:", page_result))
    }

huangapple
  • 本文由 发表于 2023年4月4日 14:42:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/75926217.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定