2023年5月21日 15:55:57go评论86阅读模式

英文:

Why does my Python-Requests script keep downloading the same page when using a list of URLs?

问题

I am trying to download the contents of a site. I am the legal owner of the following site and I am trying to download the pictures from this site.
So far I have obtained the urls of the individual pages, in this text document urls.txt:

http://www.x10.com.cn/front/active-view-content?code=pRqKeo&amp;active_role_id=p36Pao&amp;content_id=p36Pao
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=zZB9ez
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=zAKnqp
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=227420
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=227378
--snip--

And I have tried to use secondary_page.py to try to download the images on each page:

import requests
import bs4
import time
with open(&quot;urls.txt&quot;) as file_object:
    urls = file_object.readlines()
for url in urls:
    print(f&quot;URL={url}&quot;)
    resp = requests.get(url)
    txt = resp.text
    sp = bs4.BeautifulSoup(txt)
    names = sp.select(&quot;div[class=&#39;field-value xingming&#39;]&quot;)
    imgs = sp.select(&quot;img[data-src]&quot;)
    for i in range(0, 2):
        img = imgs[i]
        link = img.get(&quot;data-src&quot;)
        res2 = requests.get(link)
        with open(f&quot;caches/{names[0].text}_{i + 1}.jpg&quot;, &quot;wb&quot;) as file_object:
            for chunk in res2.iter_content(100000):
                file_object.write(chunk)
            print(f&quot;DOWNLOADING {names[0]}_{i + 1}.jpg&quot;)
    time.sleep(1)

but for some reason it won't work. The problem is that it keeps downloading the same page (specifically the first page) over and over although I had specified for it not to do so. I have included my debug method in the code. I have attempted to find out if the urls were right. They were. (The print statement. Every time it outputted a different url.) I have then attempted to use time.sleep(1) to try to ensure the website does not block me, but to no avail.

英文:

http://www.x10.com.cn/front/active-view-content?code=pRqKeo&amp;active_role_id=p36Pao&amp;content_id=p36Pao
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=zZB9ez
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=zAKnqp
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=227420
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=227378
--snip--

And I have tried to use secondary_page.py to try to download the images on each page:

import requests
import bs4
import time
with open(&quot;urls.txt&quot;) as file_object:
    urls = file_object.readlines()
for url in urls:
    print(f&quot;URL={url}&quot;)
    resp = requests.get(url)
    txt = resp.text
    sp = bs4.BeautifulSoup(txt)
    names = sp.select(&quot;div[class=&#39;field-value xingming&#39;]&quot;)
    imgs = sp.select(&quot;img[data-src]&quot;)
    for i in range(0, 2):
        img = imgs[i]
        link = img.get(&quot;data-src&quot;)
        res2 = requests.get(link)
        with open(f&quot;caches/{names[0].text}_{i + 1}.jpg&quot;, &quot;wb&quot;) as file_object:
            for chunk in res2.iter_content(100000):
                file_object.write(chunk)
            print(f&quot;DOWNLOADING {names[0]}_{i + 1}.jpg&quot;)
    time.sleep(1)

but for some reason it won't work.

The problem is that it keeps downloading the same page (specifically the first page) over and over although I had specified for it not to do so.
I have included my debug method in the code.
I have attempted to find out if the urls were right. They were. (The print statement. Every time it outputted a different url. )
I have then attempted to use time.sleep(1) to try to ensure the website does not block me, but to no avail.

答案1

得分: 0

这是略有不同的，因为实际的图像文件名被用作目标输出。

还使用了多线程以提高性能

import requests
from bs4 import BeautifulSoup as BS
from concurrent.futures import ThreadPoolExecutor as TPE
import os
TARGET_DIRECTORY = '/Volumes/G-Drive/jpgs'
URL_LIST = '/Volumes/G-Drive/urls.txt'
image_set = set()
def process(url):
    print('处理中', url)
    filename = url.split('/')[-1]
    with open(os.path.join(TARGET_DIRECTORY, filename), 'wb') as jpg:
        with requests.get(url, stream=True) as response:
            response.raise_for_status()
            for chunk in response.iter_content(16*1024):
                jpg.write(chunk)
with TPE() as tpe:
    with open(URL_LIST) as file:
        for url in map(str.strip, file):
            with requests.get(url) as response:
                response.raise_for_status()
                soup = BS(response.text, 'lxml')
                for div in soup.find_all('div', class_='content-photos content-item'):
                    for img in div.select('img[data-src]'):
                        image = img.get('data-src')
                        if not image in image_set:
                            image_set.add(image)
                            tpe.submit(process, image)

英文:

This is slightly different inasmuch as the actual image file names are used as the target output.

Also uses multithreading for enhanced performance

import requests
from bs4 import BeautifulSoup as BS
from concurrent.futures import ThreadPoolExecutor as TPE
import os
TARGET_DIRECTORY = &#39;/Volumes/G-Drive/jpgs&#39;
URL_LIST = &#39;/Volumes/G-Drive/urls.txt&#39;
image_set = set()
def process(url):
    print(&#39;Processing&#39;, url)
    filename = url.split(&#39;/&#39;)[-1]
    with open(os.path.join(TARGET_DIRECTORY, filename), &#39;wb&#39;) as jpg:
        with requests.get(url, stream=True) as response:
            response.raise_for_status()
            for chunk in response.iter_content(16*1024):
                jpg.write(chunk)
with TPE() as tpe:
    with open(URL_LIST) as file:
        for url in map(str.strip, file):
            with requests.get(url) as response:
                response.raise_for_status()
                soup = BS(response.text, &#39;lxml&#39;)
                for div in soup.find_all(&#39;div&#39;, class_=&#39;content-photos content-item&#39;):
                    for img in div.select(&#39;img[data-src]&#39;):
                        image = img.get(&#39;data-src&#39;)
                        if not image in image_set:
                            image_set.add(image)
                            tpe.submit(process, image)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么我的Python-Requests脚本在使用URL列表时不断下载相同的页面？

问题

答案1

Go中的字典

在Fairlearn中的敏感特征

在Python递归时出现神秘输出。

如何生成一个随机的二维整数数组，使得 x,y = y,x？（Python）

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。