英文:
Beautiful Soup Img Src Scrape
问题
问题: 我试图在一个网站上抓取图片的源位置,但无法成功地使用Beautiful Soup来抓取它们。
细节:
-
我想要的三张图片具有以下HTML标记:
<img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-1.jpg" style="display: none;">
<img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-2.jpg" style="display: none;">
<img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-3.jpg" style="display: none;">
我尝试过的代码:
soup.find_all('img')
soup.select('#imageFlicker')
soup.select('#imageFlicker > div')
soup.select('#imageFlicker > div > img:nth-child(1)')
soup.find_all('div', {'class':'exercise-post__step-image-wrap'})
soup.find_all('div', attrs={'id': 'imageFlicker'})
soup.select_all('#imageFlicker > div > img:nth-child(1)')
第一个查询soup.find_all('img')
获取了页面上的每个图像,除了我想要的三个图像。我尝试过查看每个上述元素的子元素和子子元素,但都不起作用。
我是否遗漏了什么?我认为可能有JavaScript正在改变CSS的display
属性,从block
变为none
,然后再变回来,所以这三个图像看起来像一个GIF而不是三个不同的图像。这是否以我不理解的方式干扰了操作?谢谢!
英文:
Problem: I am trying to scrape the image source locations for pictures on a website, but I cannot get Beautiful Soup to scrape them successfully.
Details:
-
The three images I want have the following HTML tags:
<img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-1.jpg" style="display: none;">
<img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-2.jpg" style="display: none;">
<img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-3.jpg" style="display: none;">
Code I've Tried:
soup.find_all('img')
soup.select('#imageFlicker')
soup.select('#imageFlicker > div')
soup.select('#imageFlicker > div > img:nth-child(1)')
soup.find_all('div', {'class':'exercise-post__step-image-wrap'})
soup.find_all('div', attrs={'id': 'imageFlicker'})
soup.select_all('#imageFlicker > div > img:nth-child(1)')
The very first query of soup.find_all('img')
gets every image on the page <em>except<em> the three images I want. I've tried looking at the children and sub children of each of the above, and none of that works either.
What am I missing here? I think there may be javascript that is changing the css display
attribute from block
to none
and back so the three images look like a gif instead of three different images. Is that messing things up in a way I'm not understanding? Thank you!
答案1
得分: 2
以下是翻译好的内容:
"JavaScript" 通过 JavaScript 动态提供内容,但不像在浏览器中那样通过请求呈现。
但是,您可以搜索 "JavaScript" 变量:
var data = {"images":["https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-1.jpg","https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-2.jpg","https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-3.jpg"],"interval":600};
使用正则表达式 re.search()
并使用 json.loads()
将其内容字符串转换为 JSON,以便您可以轻松访问它。
示例:
import requests
import re, json
url = 'https://www.acefitness.org/resources/everyone/exercise-library/14/bird-dog/'
json.loads(re.search(r'var data = (.*?);', requests.get(url).text).group(1))['images']
英文:
The content is provided dynmaically via JavaScript
, but not rendered by requests per se, unlike in the browser.
However, you can search for the JavaScript
variable:
var data = {"images":["https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-1.jpg","https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-2.jpg","https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-3.jpg"],"interval":600};
with regex re.search()
and convert its content string with json.loads()
to JSON, so that you can access it easily.
Example
import requests
import re, json
url = 'https://www.acefitness.org/resources/everyone/exercise-library/14/bird-dog/'
json.loads(re.search(r'var data = (.*?);', requests.get(url).text).group(1))['images']
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论