英文:
Why does my Python-Requests script keep downloading the same page when using a list of URLs?
问题
I am trying to download the contents of a site. I am the legal owner of the following site and I am trying to download the pictures from this site.
So far I have obtained the urls of the individual pages, in this text document urls.txt:
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=p36Pao
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=zZB9ez
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=zAKnqp
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=227420
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=227378
--snip--
And I have tried to use secondary_page.py to try to download the images on each page:
import requests
import bs4
import time
with open("urls.txt") as file_object:
urls = file_object.readlines()
for url in urls:
print(f"URL={url}")
resp = requests.get(url)
txt = resp.text
sp = bs4.BeautifulSoup(txt)
names = sp.select("div[class='field-value xingming']")
imgs = sp.select("img[data-src]")
for i in range(0, 2):
img = imgs[i]
link = img.get("data-src")
res2 = requests.get(link)
with open(f"caches/{names[0].text}_{i + 1}.jpg", "wb") as file_object:
for chunk in res2.iter_content(100000):
file_object.write(chunk)
print(f"DOWNLOADING {names[0]}_{i + 1}.jpg")
time.sleep(1)
but for some reason it won't work. The problem is that it keeps downloading the same page (specifically the first page) over and over although I had specified for it not to do so. I have included my debug method in the code. I have attempted to find out if the urls were right. They were. (The print statement. Every time it outputted a different url.) I have then attempted to use time.sleep(1)
to try to ensure the website does not block me, but to no avail.
英文:
I am trying to download the contents of a site. I am the legal owner of the following site and I am trying to download the pictures from this site.
So far I have obtained the urls of the individual pages, in this text document urls.txt:
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=p36Pao
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=zZB9ez
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=zAKnqp
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=227420
http://www.x10.com.cn/front/active-view-content?code=pRqKeo&active_role_id=p36Pao&content_id=227378
--snip--
And I have tried to use secondary_page.py to try to download the images on each page:
import requests
import bs4
import time
with open("urls.txt") as file_object:
urls = file_object.readlines()
for url in urls:
print(f"URL={url}")
resp = requests.get(url)
txt = resp.text
sp = bs4.BeautifulSoup(txt)
names = sp.select("div[class='field-value xingming']")
imgs = sp.select("img[data-src]")
for i in range(0, 2):
img = imgs[i]
link = img.get("data-src")
res2 = requests.get(link)
with open(f"caches/{names[0].text}_{i + 1}.jpg", "wb") as file_object:
for chunk in res2.iter_content(100000):
file_object.write(chunk)
print(f"DOWNLOADING {names[0]}_{i + 1}.jpg")
time.sleep(1)
but for some reason it won't work.
The problem is that it keeps downloading the same page (specifically the first page) over and over although I had specified for it not to do so.
I have included my debug method in the code.
I have attempted to find out if the urls were right. They were. (The print statement. Every time it outputted a different url. )
I have then attempted to use time.sleep(1)
to try to ensure the website does not block me, but to no avail.
答案1
得分: 0
这是略有不同的,因为实际的图像文件名被用作目标输出。
还使用了多线程以提高性能
import requests
from bs4 import BeautifulSoup as BS
from concurrent.futures import ThreadPoolExecutor as TPE
import os
TARGET_DIRECTORY = '/Volumes/G-Drive/jpgs'
URL_LIST = '/Volumes/G-Drive/urls.txt'
image_set = set()
def process(url):
print('处理中', url)
filename = url.split('/')[-1]
with open(os.path.join(TARGET_DIRECTORY, filename), 'wb') as jpg:
with requests.get(url, stream=True) as response:
response.raise_for_status()
for chunk in response.iter_content(16*1024):
jpg.write(chunk)
with TPE() as tpe:
with open(URL_LIST) as file:
for url in map(str.strip, file):
with requests.get(url) as response:
response.raise_for_status()
soup = BS(response.text, 'lxml')
for div in soup.find_all('div', class_='content-photos content-item'):
for img in div.select('img[data-src]'):
image = img.get('data-src')
if not image in image_set:
image_set.add(image)
tpe.submit(process, image)
英文:
This is slightly different inasmuch as the actual image file names are used as the target output.
Also uses multithreading for enhanced performance
import requests
from bs4 import BeautifulSoup as BS
from concurrent.futures import ThreadPoolExecutor as TPE
import os
TARGET_DIRECTORY = '/Volumes/G-Drive/jpgs'
URL_LIST = '/Volumes/G-Drive/urls.txt'
image_set = set()
def process(url):
print('Processing', url)
filename = url.split('/')[-1]
with open(os.path.join(TARGET_DIRECTORY, filename), 'wb') as jpg:
with requests.get(url, stream=True) as response:
response.raise_for_status()
for chunk in response.iter_content(16*1024):
jpg.write(chunk)
with TPE() as tpe:
with open(URL_LIST) as file:
for url in map(str.strip, file):
with requests.get(url) as response:
response.raise_for_status()
soup = BS(response.text, 'lxml')
for div in soup.find_all('div', class_='content-photos content-item'):
for img in div.select('img[data-src]'):
image = img.get('data-src')
if not image in image_set:
image_set.add(image)
tpe.submit(process, image)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论