英文:
How to know download file extension in python?
问题
这是我的jpg图像下载源码:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import urllib.request
import os
import shutil
from mimetypes import guess_extension
img_folder = "c:/test"
if os.path.exists(img_folder):
shutil.rmtree(img_folder)
path = r"C:\Users\qpslt\Desktop\py\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(path)
site_url = "https://gall.dcinside.com/board/view/?id=baseball_new8&no=10131338&exception_mode=recommend&page=1"
driver.get(site_url)
images = driver.find_elements_by_xpath('//div[@class="writing_view_box"]//img')
for i, img in enumerate(images, 1):
img_url = img.get_attribute('src')
print(i, img_url)
r = requests.get(img_url, headers={'Referer': site_url})
try: #文件夹创建
if not os.path.exists(img_folder):
os.makedirs(img_folder)
except Exception as er:
print("{}发生错误。".format(er))
break;
break;
with open("c:/test/{}.jpg".format(i), 'wb') as f:
f.write(r.content)
我不总是知道正在下载的文件的扩展名。
您如何知道要下载的文件的扩展名?
<details>
<summary>英文:</summary>
This is my jpg image download source:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import urllib.request
import os
import shutil
from mimetypes import guess_extension
img_folder = ("c:/test")
if os.path.exists(img_folder):
shutil.rmtree(img_folder)
path = (r"C:\Users\qpslt\Desktop\py\chromedriver_win32\chromedriver.exe")
driver = webdriver.Chrome(path)
site_url = ("https://gall.dcinside.com/board/view/?id=baseball_new8&no=10131338&exception_mode=recommend&page=1")
driver.get(site_url)
images = driver.find_elements_by_xpath('//div[@class="writing_view_box"]//img')
for i, img in enumerate(images, 1):
img_url = img.get_attribute('src')
print(i, img_url)
r = requests.get(img_url, headers={'Referer': site_url})
try: #폴더 만들기
if not os.path.exists(img_folder):
os.makedirs(img_folder)
except Exception as er:
print("{}에러가 발생했습니다.".format(er))
break;
break;
with open("c:/test/{}.jpg".format(i), 'wb') as f:
f.write(r.content)
I don't always know the extension of the image.
How do you know the extension of the file you are downloading?
</details>
# 答案1
**得分**: 5
如果图片链接没有扩展名(例如,如果图片是通过php脚本动态生成的),那么您可以使用[`mimetypes.guess_extension()`](https://docs.python.org/3/library/mimetypes.html#mimetypes.guess_extension)将图片响应的`content-type`标头映射到文件扩展名。
例如:
```python
import mimetypes
...
r = requests.get(img_url, headers={'Referer': site_url})
extension = mimetypes.guess_extension(r.headers.get('content-type', '').split(';')[0])
...
with open("c:/test/{}{}".format(i, extension or '.jpg'), 'wb') as f:
上面的示例将尝试在存在映射时使用映射的扩展名,但在没有映射时会回退到使用.jpg
(例如,如果content-type
标头不存在或指定了未知类型)。
英文:
If the image link has no extension (e.g. if the image is dynamically generated from a php script), then you could map the content-type
header of the image response to the file extension using mimetypes.guess_extension()
For example:
import mimetypes
...
r = requests.get(img_url, headers={'Referer': site_url})
extension = mimetypes.guess_extension(r.headers.get('content-type', '').split(';')[0])
...
with open("c:/test/{}{}".format(i, extension or '.jpg'), 'wb') as f:
The example above will try to use the mapped extension when it exists, but will fall back to using .jpg
when there is no mapping (e.g. if the content-type
header does not exist or specifies an unknown type).
答案2
得分: 1
我遇到了相同的问题,决定手动获取所有标头(使用request.headers.get()
),其中一个实际上带有文件扩展名。
在我的情况下,它是'Content-Disposition': 'attachment;filename=%D0%BF%D1%80%D0%BE%D0%B5%D0%BA%D1%82%20%D0%A4%D0%97.doc'
。
英文:
i bumped into the same problem and decided to get all the headers manually (with request.headers.get()
)
and one of them actually had a filename with the extension
in my case it was 'Content-Disposition': 'attachment;filename=%D0%BF%D1%80%D0%BE%D0%B5%D0%BA%D1%82%20%D0%A4%D0%97.doc'
答案3
得分: 0
我会首先尝试按照/
进行拆分,以获取您下载的图片的最后部分,我认为那是图片的文件名(包括扩展名),然后再按.
进行拆分,以将文件名与扩展名分开。
img_url = 'path/to/your/picture.jpg'
split1 = img_url.split('/') # 返回 [‘path’, ‘to’, ‘your’, ‘picture.jpg’]
file = split1[-1]
filename, extension = file.split('.') # file 是 ‘picture’,extension 是 ‘jpg’
英文:
I would try to first split on the /
to get only the last part which is, I think, the filename (including the extension) of the picture you've downloaded, and then split on the .
to separate the filename from the extension.
img_url = 'path/to/your/picture.jpg'
split1 = img_url.split('/') # returns ['path', 'to', 'your', 'picture.jpg']
file = split1[-1]
filename, extension = file.split('.') # file is 'picture' and extension is 'jpg'
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论