2023年6月2日 12:43:16go评论71阅读模式

英文:

Beautiful Soup Img Src Scrape

问题

问题： 我试图在一个网站上抓取图片的源位置，但无法成功地使用Beautiful Soup来抓取它们。

细节：

这是网站链接
我想要的三张图片具有以下HTML标记：
- <img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-1.jpg" style="display: none;">
- <img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-2.jpg" style="display: none;">
- <img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-3.jpg" style="display: none;">

我尝试过的代码:

soup.find_all('img')
soup.select('#imageFlicker')
soup.select('#imageFlicker > div')
soup.select('#imageFlicker > div > img:nth-child(1)')
soup.find_all('div', {'class':'exercise-post__step-image-wrap'})
soup.find_all('div', attrs={'id': 'imageFlicker'})
soup.select_all('#imageFlicker > div > img:nth-child(1)')

第一个查询soup.find_all('img')获取了页面上的每个图像，除了我想要的三个图像。我尝试过查看每个上述元素的子元素和子子元素，但都不起作用。

我是否遗漏了什么？我认为可能有JavaScript正在改变CSS的display属性，从block变为none，然后再变回来，所以这三个图像看起来像一个GIF而不是三个不同的图像。这是否以我不理解的方式干扰了操作？谢谢！

英文:

Problem: I am trying to scrape the image source locations for pictures on a website, but I cannot get Beautiful Soup to scrape them successfully.

Details:

Here is the website
The three images I want have the following HTML tags:
- <img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-1.jpg" style="display: none;">
- <img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-2.jpg" style="display: none;">
- <img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-3.jpg" style="display: none;">

Code I've Tried:

soup.find_all('img')
soup.select('#imageFlicker')
soup.select('#imageFlicker > div')
soup.select('#imageFlicker > div > img:nth-child(1)')
soup.find_all('div', {'class':'exercise-post__step-image-wrap'})
soup.find_all('div', attrs={'id': 'imageFlicker'})
soup.select_all('#imageFlicker > div > img:nth-child(1)')

The very first query of soup.find_all('img') gets every image on the page <em>except<em> the three images I want. I've tried looking at the children and sub children of each of the above, and none of that works either.

What am I missing here? I think there may be javascript that is changing the css display attribute from block to none and back so the three images look like a gif instead of three different images. Is that messing things up in a way I'm not understanding? Thank you!

答案1

得分: 2

以下是翻译好的内容：

"JavaScript" 通过 JavaScript 动态提供内容，但不像在浏览器中那样通过请求呈现。

但是，您可以搜索 "JavaScript" 变量：

var data = {"images":["https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-1.jpg","https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-2.jpg","https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-3.jpg"],"interval":600};

使用正则表达式 re.search() 并使用 json.loads() 将其内容字符串转换为 JSON，以便您可以轻松访问它。

示例：

import requests
import re, json

url = 'https://www.acefitness.org/resources/everyone/exercise-library/14/bird-dog/'

json.loads(re.search(r'var data = (.*?);', requests.get(url).text).group(1))['images']

英文:

The content is provided dynmaically via JavaScript, but not rendered by requests per se, unlike in the browser.

However, you can search for the JavaScript variable:

var data = {&quot;images&quot;:[&quot;https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-1.jpg&quot;,&quot;https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-2.jpg&quot;,&quot;https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-3.jpg&quot;],&quot;interval&quot;:600};

with regex re.search() and convert its content string with json.loads() to JSON, so that you can access it easily.

Example

import requests
import re, json

url = &#39;https://www.acefitness.org/resources/everyone/exercise-library/14/bird-dog/&#39;

json.loads(re.search(r&#39;var data = (.*?);&#39;, requests.get(url).text).group(1))[&#39;images&#39;]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

美丽汤 Img Src 抓取

问题

答案1

Example

CORS问题在使用Flask + Typescript进行POST请求时出现。

如何格式化引用列名的UPDATE查询？

在Python中变量的大小

Pydantic根验证器无法访问类属性。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论