尝试使用Nokogiri抓取图像,但返回了一个我不希望看到的链接。

huangapple go评论90阅读模式
英文:

Trying to scrape an image using Nokogiri but it returns a link that I was not expecting

问题

以下是要翻译的代码部分:

url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read

html = Nokogiri::HTML.parse(serialized_html)

title = html.search('.headline-1').text.strip
overview = html.search('.truncate p').text.strip
poster = html.search('.film-poster img').attribute('src').value

{
  title: title,
  overview: overview,
  poster_url: poster,
}
英文:

I'm doing a scraping exercise and trying to scrape the poster from a website using Nokogiri.

This is the link that I want to get:
https://a.ltrbxd.com/resized/film-poster/5/8/6/7/2/3/586723-glass-onion-a-knives-out-mystery-0-460-0-690-crop.jpg?v=ce7ed2a83f

But instead I got this:
https://s.ltrbxd.com/static/img/empty-poster-500.825678f0.png

Why?

This is what I tried:

url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read

html = Nokogiri::HTML.parse(serialized_html)

title = html.search('.headline-1').text.strip
overview = html.search('.truncate p').text.strip
poster = html.search('.film-poster img').attribute('src').value

{
  title: title,
  overview: overview,
  poster_url: poster,
}

答案1

得分: 2

与您的Ruby代码无关。

如果您在终端中运行类似以下的命令:

curl https://letterboxd.com/film/glass-onion-a-knives-out-mystery/

您会发现输出的HTML中并没有您正在寻找的图像。您之后在浏览器中看到这些图像是因为在初始加载后,一些JavaScript会运行并加载更多资源。

加载您要寻找的图像的Ajax调用是https://letterboxd.com/ajax/poster/film/glass-onion-a-knives-out-mystery/std/500x750/?k=0c10a16c

使用浏览器的网络检查工具,您可以识别网站的不同部分以及它们各自如何加载。

英文:

It has nothing to do with your ruby code.

If you run in your terminal something like

curl https://letterboxd.com/film/glass-onion-a-knives-out-mystery/ 

You can see that the output HTML does not have the images you are looking for. You can see then in your browser because after that initial load some javascript runs and loads more resources.

The ajax call that loads the image you are looking for is https://letterboxd.com/ajax/poster/film/glass-onion-a-knives-out-mystery/std/500x750/?k=0c10a16c

Play with the network inspector of your browser and you will be able to identify the different parts of the website and how each one loads.

答案2

得分: 0

Nokogiri不执行Javascript,但链接必须存在,或者至少必须存在一个返回链接的API。

首先,我会搜索图像元素或其父元素的数据属性,但在这种情况下,它被隐藏在内联脚本中,还包括有关电影的其他有趣数据。

首先使用curlwget下载网页,然后在文本编辑器中打开文件,查看Nokogiri看到的内容。搜索您了解的文件内容,我搜索了图像URL的一部分ce7ed2a83f,然后找到了JSON。

然后,可以像这样提取数据:

require 'nokogiri'
require 'open-uri'
require 'json'

url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)

data_str = html.search('script[type="application/ld+json"]').first.to_s.gsub("\n",'').match(/{.*}/).to_s
data = JSON.parse(data_str)
data['image']
英文:

Nokogiri does not execute Javascript however the link has to be there or at least there has to be a link to some API that returns the link.

First place where I would search for it would be the data attributes of the image element or its parent however in this case it was hidden in an inline script along with some other interesting data about the movie.

First download the web page using curl or wget and open the file in text editor to see what Nokogiri sees. Search for something you know about the file, I searched for ce7ed2a83f part of the image url and found the JSON.

Then the data can be extracted like this:

require 'nokogiri'
require 'open-uri'
require 'json'

url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)

data_str = html.search('script[type="application/ld+json"]').first.to_s.gsub("\n",'').match(/{.*}/).to_s
data = JSON.parse(data_str)
data['image']

huangapple
  • 本文由 发表于 2023年2月19日 13:23:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/75498170.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定