2020年1月3日 16:55:43go评论169阅读模式

英文:

How to know download file extension in python?

问题

这是我的jpg图像下载源码：

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import urllib.request
import os
import shutil
from mimetypes import guess_extension
img_folder = "c:/test"
if os.path.exists(img_folder):
    shutil.rmtree(img_folder)
path = r"C:\Users\qpslt\Desktop\py\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(path)
site_url = "https://gall.dcinside.com/board/view/?id=baseball_new8&no=10131338&exception_mode=recommend&page=1"
driver.get(site_url)
images = driver.find_elements_by_xpath('//div[@class="writing_view_box"]//img')
for i, img in enumerate(images, 1):
    img_url = img.get_attribute('src')
    print(i, img_url)
    r = requests.get(img_url, headers={'Referer': site_url})
    try:   #文件夹创建
        if not os.path.exists(img_folder):
            os.makedirs(img_folder)
    except Exception as er:
        print("{}发生错误。".format(er))
        break;
    break;
    with open("c:/test/{}.jpg".format(i), 'wb') as f:
        f.write(r.content)
我不总是知道正在下载的文件的扩展名。
您如何知道要下载的文件的扩展名？
<details>
<summary>英文:</summary>
This is my jpg image download source:
    from bs4 import BeautifulSoup
    import requests
    from selenium import webdriver
    import urllib.request
    import os
    import shutil
    from mimetypes import guess_extension
    
    img_folder = (&quot;c:/test&quot;)
    if os.path.exists(img_folder):
        shutil.rmtree(img_folder)
    
    path = (r&quot;C:\Users\qpslt\Desktop\py\chromedriver_win32\chromedriver.exe&quot;)
    driver = webdriver.Chrome(path)
    site_url = (&quot;https://gall.dcinside.com/board/view/?id=baseball_new8&amp;no=10131338&amp;exception_mode=recommend&amp;page=1&quot;)
    driver.get(site_url)
    images = driver.find_elements_by_xpath(&#39;//div[@class=&quot;writing_view_box&quot;]//img&#39;)
    
    for i, img in enumerate(images, 1):
        img_url = img.get_attribute(&#39;src&#39;)
        print(i, img_url)
        r = requests.get(img_url, headers={&#39;Referer&#39;: site_url})
        try:   #폴더 만들기
            if not os.path.exists(img_folder):
                os.makedirs(img_folder)
        except Exception as er:
            print(&quot;{}에러가 발생했습니다.&quot;.format(er))
            break;
        break;
        with open(&quot;c:/test/{}.jpg&quot;.format(i), &#39;wb&#39;) as f:
            f.write(r.content)
I don&#39;t always know the extension of the image.
How do you know the extension of the file you are downloading?
</details>
# 答案1
**得分**: 5
如果图片链接没有扩展名（例如，如果图片是通过php脚本动态生成的），那么您可以使用[`mimetypes.guess_extension()`](https://docs.python.org/3/library/mimetypes.html#mimetypes.guess_extension)将图片响应的`content-type`标头映射到文件扩展名。
例如：
```python
import mimetypes
...
r = requests.get(img_url, headers={'Referer': site_url})
extension = mimetypes.guess_extension(r.headers.get('content-type', '').split(';')[0])
...
with open("c:/test/{}{}".format(i, extension or '.jpg'), 'wb') as f:

上面的示例将尝试在存在映射时使用映射的扩展名，但在没有映射时会回退到使用.jpg（例如，如果content-type标头不存在或指定了未知类型）。

英文:

If the image link has no extension (e.g. if the image is dynamically generated from a php script), then you could map the content-type header of the image response to the file extension using mimetypes.guess_extension()

For example:

import mimetypes
...
r = requests.get(img_url, headers={&#39;Referer&#39;: site_url})
extension = mimetypes.guess_extension(r.headers.get(&#39;content-type&#39;, &#39;&#39;).split(&#39;;&#39;)[0]) 
...
with open(&quot;c:/test/{}{}&quot;.format(i, extension or &#39;.jpg&#39;), &#39;wb&#39;) as f:

The example above will try to use the mapped extension when it exists, but will fall back to using .jpg when there is no mapping (e.g. if the content-type header does not exist or specifies an unknown type).

答案2

得分: 1

我遇到了相同的问题，决定手动获取所有标头（使用request.headers.get()），其中一个实际上带有文件扩展名。

在我的情况下，它是'Content-Disposition': 'attachment;filename=%D0%BF%D1%80%D0%BE%D0%B5%D0%BA%D1%82%20%D0%A4%D0%97.doc'。

英文:

i bumped into the same problem and decided to get all the headers manually (with request.headers.get())
and one of them actually had a filename with the extension

in my case it was 'Content-Disposition': 'attachment;filename=%D0%BF%D1%80%D0%BE%D0%B5%D0%BA%D1%82%20%D0%A4%D0%97.doc'

答案3

得分: 0

我会首先尝试按照/进行拆分，以获取您下载的图片的最后部分，我认为那是图片的文件名（包括扩展名），然后再按.进行拆分，以将文件名与扩展名分开。

img_url = 'path/to/your/picture.jpg'
split1 = img_url.split('/') # 返回 [‘path’, ‘to’, ‘your’, ‘picture.jpg’]
file = split1[-1]
filename, extension = file.split('.') # file 是 ‘picture’，extension 是 ‘jpg’

英文:

I would try to first split on the / to get only the last part which is, I think, the filename (including the extension) of the picture you've downloaded, and then split on the . to separate the filename from the extension.

img_url = &#39;path/to/your/picture.jpg&#39;
split1 = img_url.split(&#39;/&#39;) # returns [&#39;path&#39;, &#39;to&#39;, &#39;your&#39;, &#39;picture.jpg&#39;]
file = split1[-1]
filename, extension = file.split(&#39;.&#39;) # file is &#39;picture&#39; and extension is &#39;jpg&#39;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Python中知道下载文件的扩展名？

问题

答案2

答案3

你可以使用ColorThief来获取多个图像的主要颜色吗？

是否可以在从PDF文档中提取文本时获取行号？

如何从另一个模块调用的方法传递参数？

为什么如果我通过VS Code而不是PyCharm运行程序，图像不可用？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。