为什么我的Python-Requests脚本在使用URL列表时不断下载相同的页面?

huangapple go评论59阅读模式
英文:

Why does my Python-Requests script keep downloading the same page when using a list of URLs?

问题

I am trying to download the contents of a site. I am the legal owner of the following site and I am trying to download the pictures from this site.
So far I have obtained the urls of the individual pages, in this text document urls.txt:

http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=p36Pao
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=zZB9ez
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=zAKnqp
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=227420
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=227378
--snip--

And I have tried to use secondary_page.py to try to download the images on each page:

import requests
import bs4
import time

with open("urls.txt") as file_object:
    urls = file_object.readlines()
for url in urls:
    print(f"URL={url}")
    resp = requests.get(url)
    txt = resp.text
    sp = bs4.BeautifulSoup(txt)
    names = sp.select("div[class='field-value xingming']")
    imgs = sp.select("img[data-src]")
    for i in range(0, 2):
        img = imgs[i]
        link = img.get("data-src")
        res2 = requests.get(link)
        with open(f"caches/{names[0].text}_{i + 1}.jpg", "wb") as file_object:
            for chunk in res2.iter_content(100000):
                file_object.write(chunk)
            print(f"DOWNLOADING {names[0]}_{i + 1}.jpg")
    time.sleep(1)

but for some reason it won't work. The problem is that it keeps downloading the same page (specifically the first page) over and over although I had specified for it not to do so. I have included my debug method in the code. I have attempted to find out if the urls were right. They were. (The print statement. Every time it outputted a different url.) I have then attempted to use time.sleep(1) to try to ensure the website does not block me, but to no avail.

英文:

I am trying to download the contents of a site. I am the legal owner of the following site and I am trying to download the pictures from this site.
So far I have obtained the urls of the individual pages, in this text document urls.txt:

http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=p36Pao
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=zZB9ez
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=zAKnqp
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=227420
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=227378
--snip--

And I have tried to use secondary_page.py to try to download the images on each page:

import requests
import bs4
import time

with open("urls.txt") as file_object:
    urls = file_object.readlines()
for url in urls:
    print(f"URL={url}")
    resp = requests.get(url)
    txt = resp.text
    sp = bs4.BeautifulSoup(txt)
    names = sp.select("div[class='field-value xingming']")
    imgs = sp.select("img[data-src]")
    for i in range(0, 2):
        img = imgs[i]
        link = img.get("data-src")
        res2 = requests.get(link)
        with open(f"caches/{names[0].text}_{i + 1}.jpg", "wb") as file_object:
            for chunk in res2.iter_content(100000):
                file_object.write(chunk)
            print(f"DOWNLOADING {names[0]}_{i + 1}.jpg")
    time.sleep(1)

but for some reason it won't work.

The problem is that it keeps downloading the same page (specifically the first page) over and over although I had specified for it not to do so.
I have included my debug method in the code.
I have attempted to find out if the urls were right. They were. (The print statement. Every time it outputted a different url. )
I have then attempted to use time.sleep(1) to try to ensure the website does not block me, but to no avail.

答案1

得分: 0

这是略有不同的,因为实际的图像文件名被用作目标输出。

还使用了多线程以提高性能

import requests
from bs4 import BeautifulSoup as BS
from concurrent.futures import ThreadPoolExecutor as TPE
import os

TARGET_DIRECTORY = '/Volumes/G-Drive/jpgs'
URL_LIST = '/Volumes/G-Drive/urls.txt'
image_set = set()

def process(url):
    print('处理中', url)
    filename = url.split('/')[-1]
    with open(os.path.join(TARGET_DIRECTORY, filename), 'wb') as jpg:
        with requests.get(url, stream=True) as response:
            response.raise_for_status()
            for chunk in response.iter_content(16*1024):
                jpg.write(chunk)

with TPE() as tpe:
    with open(URL_LIST) as file:
        for url in map(str.strip, file):
            with requests.get(url) as response:
                response.raise_for_status()
                soup = BS(response.text, 'lxml')
                for div in soup.find_all('div', class_='content-photos content-item'):
                    for img in div.select('img[data-src]'):
                        image = img.get('data-src')
                        if not image in image_set:
                            image_set.add(image)
                            tpe.submit(process, image)
英文:

This is slightly different inasmuch as the actual image file names are used as the target output.

Also uses multithreading for enhanced performance

import requests
from bs4 import BeautifulSoup as BS
from concurrent.futures import ThreadPoolExecutor as TPE
import os

TARGET_DIRECTORY = '/Volumes/G-Drive/jpgs'
URL_LIST = '/Volumes/G-Drive/urls.txt'
image_set = set()

def process(url):
    print('Processing', url)
    filename = url.split('/')[-1]
    with open(os.path.join(TARGET_DIRECTORY, filename), 'wb') as jpg:
        with requests.get(url, stream=True) as response:
            response.raise_for_status()
            for chunk in response.iter_content(16*1024):
                jpg.write(chunk)

with TPE() as tpe:
    with open(URL_LIST) as file:
        for url in map(str.strip, file):
            with requests.get(url) as response:
                response.raise_for_status()
                soup = BS(response.text, 'lxml')
                for div in soup.find_all('div', class_='content-photos content-item'):
                    for img in div.select('img[data-src]'):
                        image = img.get('data-src')
                        if not image in image_set:
                            image_set.add(image)
                            tpe.submit(process, image)

huangapple
  • 本文由 发表于 2023年5月21日 15:55:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76298846.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定