2023年2月19日 13:23:05go评论125阅读模式

英文:

Trying to scrape an image using Nokogiri but it returns a link that I was not expecting

问题

以下是要翻译的代码部分：

url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
title = html.search('.headline-1').text.strip
overview = html.search('.truncate p').text.strip
poster = html.search('.film-poster img').attribute('src').value
{
  title: title,
  overview: overview,
  poster_url: poster,
}

英文:

I'm doing a scraping exercise and trying to scrape the poster from a website using Nokogiri.

This is the link that I want to get:
https://a.ltrbxd.com/resized/film-poster/5/8/6/7/2/3/586723-glass-onion-a-knives-out-mystery-0-460-0-690-crop.jpg?v=ce7ed2a83f

But instead I got this:
https://s.ltrbxd.com/static/img/empty-poster-500.825678f0.png

Why?

This is what I tried:

url = &quot;https://letterboxd.com/film/glass-onion-a-knives-out-mystery/&quot;
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
title = html.search(&#39;.headline-1&#39;).text.strip
overview = html.search(&#39;.truncate p&#39;).text.strip
poster = html.search(&#39;.film-poster img&#39;).attribute(&#39;src&#39;).value
{
  title: title,
  overview: overview,
  poster_url: poster,
}

答案1

得分: 2

与您的Ruby代码无关。

如果您在终端中运行类似以下的命令：

curl https://letterboxd.com/film/glass-onion-a-knives-out-mystery/

您会发现输出的HTML中并没有您正在寻找的图像。您之后在浏览器中看到这些图像是因为在初始加载后，一些JavaScript会运行并加载更多资源。

加载您要寻找的图像的Ajax调用是https://letterboxd.com/ajax/poster/film/glass-onion-a-knives-out-mystery/std/500x750/?k=0c10a16c

使用浏览器的网络检查工具，您可以识别网站的不同部分以及它们各自如何加载。

英文:

It has nothing to do with your ruby code.

If you run in your terminal something like

curl https://letterboxd.com/film/glass-onion-a-knives-out-mystery/

You can see that the output HTML does not have the images you are looking for. You can see then in your browser because after that initial load some javascript runs and loads more resources.

The ajax call that loads the image you are looking for is https://letterboxd.com/ajax/poster/film/glass-onion-a-knives-out-mystery/std/500x750/?k=0c10a16c

Play with the network inspector of your browser and you will be able to identify the different parts of the website and how each one loads.

答案2

得分: 0

Nokogiri不执行Javascript，但链接必须存在，或者至少必须存在一个返回链接的API。

首先，我会搜索图像元素或其父元素的数据属性，但在这种情况下，它被隐藏在内联脚本中，还包括有关电影的其他有趣数据。

首先使用curl或wget下载网页，然后在文本编辑器中打开文件，查看Nokogiri看到的内容。搜索您了解的文件内容，我搜索了图像URL的一部分ce7ed2a83f，然后找到了JSON。

然后，可以像这样提取数据：

require &#39;nokogiri&#39;
require &#39;open-uri&#39;
require &#39;json&#39;
url = &quot;https://letterboxd.com/film/glass-onion-a-knives-out-mystery/&quot;
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
data_str = html.search(&#39;script[type=&quot;application/ld+json&quot;]&#39;).first.to_s.gsub(&quot;\n&quot;,&#39;&#39;).match(/{.*}/).to_s
data = JSON.parse(data_str)
data[&#39;image&#39;]

英文:

Nokogiri does not execute Javascript however the link has to be there or at least there has to be a link to some API that returns the link.

First place where I would search for it would be the data attributes of the image element or its parent however in this case it was hidden in an inline script along with some other interesting data about the movie.

First download the web page using curl or wget and open the file in text editor to see what Nokogiri sees. Search for something you know about the file, I searched for ce7ed2a83f part of the image url and found the JSON.

Then the data can be extracted like this:

require &#39;nokogiri&#39;
require &#39;open-uri&#39;
require &#39;json&#39;
url = &quot;https://letterboxd.com/film/glass-onion-a-knives-out-mystery/&quot;
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
data_str = html.search(&#39;script[type=&quot;application/ld+json&quot;]&#39;).first.to_s.gsub(&quot;\n&quot;,&#39;&#39;).match(/{.*}/).to_s
data = JSON.parse(data_str)
data[&#39;image&#39;]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

尝试使用Nokogiri抓取图像，但返回了一个我不希望看到的链接。

问题

答案1

答案2

使用Python Selenium捕获binance.com网站中所有位于tr标签内的数据。

Trying to create a streamlit app that uses user-provided URLs to scrape and return a downloadable df

How to move nested hash up one level in a 2-level deep association with ActiveModel::Serializer?

在Rails数据库中显示所有值都为nil的列。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。