英文:
Trying to scrape an image using Nokogiri but it returns a link that I was not expecting
问题
以下是要翻译的代码部分:
url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
title = html.search('.headline-1').text.strip
overview = html.search('.truncate p').text.strip
poster = html.search('.film-poster img').attribute('src').value
{
title: title,
overview: overview,
poster_url: poster,
}
英文:
I'm doing a scraping exercise and trying to scrape the poster from a website using Nokogiri.
This is the link that I want to get:
https://a.ltrbxd.com/resized/film-poster/5/8/6/7/2/3/586723-glass-onion-a-knives-out-mystery-0-460-0-690-crop.jpg?v=ce7ed2a83f
But instead I got this:
https://s.ltrbxd.com/static/img/empty-poster-500.825678f0.png
Why?
This is what I tried:
url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
title = html.search('.headline-1').text.strip
overview = html.search('.truncate p').text.strip
poster = html.search('.film-poster img').attribute('src').value
{
title: title,
overview: overview,
poster_url: poster,
}
答案1
得分: 2
与您的Ruby代码无关。
如果您在终端中运行类似以下的命令:
curl https://letterboxd.com/film/glass-onion-a-knives-out-mystery/
您会发现输出的HTML中并没有您正在寻找的图像。您之后在浏览器中看到这些图像是因为在初始加载后,一些JavaScript会运行并加载更多资源。
加载您要寻找的图像的Ajax调用是https://letterboxd.com/ajax/poster/film/glass-onion-a-knives-out-mystery/std/500x750/?k=0c10a16c
使用浏览器的网络检查工具,您可以识别网站的不同部分以及它们各自如何加载。
英文:
It has nothing to do with your ruby code.
If you run in your terminal something like
curl https://letterboxd.com/film/glass-onion-a-knives-out-mystery/
You can see that the output HTML does not have the images you are looking for. You can see then in your browser because after that initial load some javascript runs and loads more resources.
The ajax call that loads the image you are looking for is https://letterboxd.com/ajax/poster/film/glass-onion-a-knives-out-mystery/std/500x750/?k=0c10a16c
Play with the network inspector of your browser and you will be able to identify the different parts of the website and how each one loads.
答案2
得分: 0
Nokogiri不执行Javascript,但链接必须存在,或者至少必须存在一个返回链接的API。
首先,我会搜索图像元素或其父元素的数据属性,但在这种情况下,它被隐藏在内联脚本中,还包括有关电影的其他有趣数据。
首先使用curl
或wget
下载网页,然后在文本编辑器中打开文件,查看Nokogiri看到的内容。搜索您了解的文件内容,我搜索了图像URL的一部分ce7ed2a83f
,然后找到了JSON。
然后,可以像这样提取数据:
require 'nokogiri'
require 'open-uri'
require 'json'
url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
data_str = html.search('script[type="application/ld+json"]').first.to_s.gsub("\n",'').match(/{.*}/).to_s
data = JSON.parse(data_str)
data['image']
英文:
Nokogiri does not execute Javascript however the link has to be there or at least there has to be a link to some API that returns the link.
First place where I would search for it would be the data attributes of the image element or its parent however in this case it was hidden in an inline script along with some other interesting data about the movie.
First download the web page using curl
or wget
and open the file in text editor to see what Nokogiri sees. Search for something you know about the file, I searched for ce7ed2a83f
part of the image url and found the JSON.
Then the data can be extracted like this:
require 'nokogiri'
require 'open-uri'
require 'json'
url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
data_str = html.search('script[type="application/ld+json"]').first.to_s.gsub("\n",'').match(/{.*}/).to_s
data = JSON.parse(data_str)
data['image']
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论