美丽汤 Img Src 抓取

huangapple go评论66阅读模式
英文:

Beautiful Soup Img Src Scrape

问题

问题: 我试图在一个网站上抓取图片的源位置,但无法成功地使用Beautiful Soup来抓取它们。

细节:

  • 这是网站链接

  • 我想要的三张图片具有以下HTML标记:

    • <img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-1.jpg" style="display: none;">
    • <img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-2.jpg" style="display: none;">
    • <img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-3.jpg" style="display: none;">

我尝试过的代码:

  • soup.find_all('img')
  • soup.select('#imageFlicker')
  • soup.select('#imageFlicker > div')
  • soup.select('#imageFlicker > div > img:nth-child(1)')
  • soup.find_all('div', {'class':'exercise-post__step-image-wrap'})
  • soup.find_all('div', attrs={'id': 'imageFlicker'})
  • soup.select_all('#imageFlicker > div > img:nth-child(1)')

第一个查询soup.find_all('img')获取了页面上的每个图像,除了我想要的三个图像。我尝试过查看每个上述元素的子元素和子子元素,但都不起作用。

我是否遗漏了什么?我认为可能有JavaScript正在改变CSS的display属性,从block变为none,然后再变回来,所以这三个图像看起来像一个GIF而不是三个不同的图像。这是否以我不理解的方式干扰了操作?谢谢!

英文:

Problem: I am trying to scrape the image source locations for pictures on a website, but I cannot get Beautiful Soup to scrape them successfully.

Details:

  • Here is the website

  • The three images I want have the following HTML tags:

    • &lt;img src=&quot;https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-1.jpg&quot; style=&quot;display: none;&quot;&gt;
    • &lt;img src=&quot;https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-2.jpg&quot; style=&quot;display: none;&quot;&gt;
    • &lt;img src=&quot;https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-3.jpg&quot; style=&quot;display: none;&quot;&gt;

Code I've Tried:

  • soup.find_all(&#39;img&#39;)
  • soup.select(&#39;#imageFlicker&#39;)
  • soup.select(&#39;#imageFlicker &gt; div&#39;)
  • soup.select(&#39;#imageFlicker &gt; div &gt; img:nth-child(1)&#39;)
  • soup.find_all(&#39;div&#39;, {&#39;class&#39;:&#39;exercise-post__step-image-wrap&#39;})
  • soup.find_all(&#39;div&#39;, attrs={&#39;id&#39;: &#39;imageFlicker&#39;})
  • soup.select_all(&#39;#imageFlicker &gt; div &gt; img:nth-child(1)&#39;)

The very first query of soup.find_all(&#39;img&#39;) gets every image on the page <em>except<em> the three images I want. I've tried looking at the children and sub children of each of the above, and none of that works either.

What am I missing here? I think there may be javascript that is changing the css display attribute from block to none and back so the three images look like a gif instead of three different images. Is that messing things up in a way I'm not understanding? Thank you!

答案1

得分: 2

以下是翻译好的内容:

"JavaScript" 通过 JavaScript 动态提供内容,但不像在浏览器中那样通过请求呈现。

但是,您可以搜索 "JavaScript" 变量:

var data = {"images":["https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-1.jpg","https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-2.jpg","https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-3.jpg"],"interval":600};

使用正则表达式 re.search() 并使用 json.loads() 将其内容字符串转换为 JSON,以便您可以轻松访问它。

示例:

import requests
import re, json

url = 'https://www.acefitness.org/resources/everyone/exercise-library/14/bird-dog/'

json.loads(re.search(r'var data = (.*?);', requests.get(url).text).group(1))['images']
英文:

The content is provided dynmaically via JavaScript, but not rendered by requests per se, unlike in the browser.

However, you can search for the JavaScript variable:

var data = {&quot;images&quot;:[&quot;https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-1.jpg&quot;,&quot;https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-2.jpg&quot;,&quot;https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-3.jpg&quot;],&quot;interval&quot;:600};

with regex re.search() and convert its content string with json.loads() to JSON, so that you can access it easily.

Example
import requests
import re, json

url = &#39;https://www.acefitness.org/resources/everyone/exercise-library/14/bird-dog/&#39;

json.loads(re.search(r&#39;var data = (.*?);&#39;, requests.get(url).text).group(1))[&#39;images&#39;]

huangapple
  • 本文由 发表于 2023年6月2日 12:43:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76387171.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定