如何在Python中知道下载文件的扩展名?

huangapple go评论169阅读模式
英文:

How to know download file extension in python?

问题

这是我的jpg图像下载源码:

  1. from bs4 import BeautifulSoup
  2. import requests
  3. from selenium import webdriver
  4. import urllib.request
  5. import os
  6. import shutil
  7. from mimetypes import guess_extension
  8. img_folder = "c:/test"
  9. if os.path.exists(img_folder):
  10. shutil.rmtree(img_folder)
  11. path = r"C:\Users\qpslt\Desktop\py\chromedriver_win32\chromedriver.exe"
  12. driver = webdriver.Chrome(path)
  13. site_url = "https://gall.dcinside.com/board/view/?id=baseball_new8&no=10131338&exception_mode=recommend&page=1"
  14. driver.get(site_url)
  15. images = driver.find_elements_by_xpath('//div[@class="writing_view_box"]//img')
  16. for i, img in enumerate(images, 1):
  17. img_url = img.get_attribute('src')
  18. print(i, img_url)
  19. r = requests.get(img_url, headers={'Referer': site_url})
  20. try: #文件夹创建
  21. if not os.path.exists(img_folder):
  22. os.makedirs(img_folder)
  23. except Exception as er:
  24. print("{}发生错误。".format(er))
  25. break;
  26. break;
  27. with open("c:/test/{}.jpg".format(i), 'wb') as f:
  28. f.write(r.content)
  29. 我不总是知道正在下载的文件的扩展名
  30. 您如何知道要下载的文件的扩展名
  31. <details>
  32. <summary>英文:</summary>
  33. This is my jpg image download source:
  34. from bs4 import BeautifulSoup
  35. import requests
  36. from selenium import webdriver
  37. import urllib.request
  38. import os
  39. import shutil
  40. from mimetypes import guess_extension
  41. img_folder = (&quot;c:/test&quot;)
  42. if os.path.exists(img_folder):
  43. shutil.rmtree(img_folder)
  44. path = (r&quot;C:\Users\qpslt\Desktop\py\chromedriver_win32\chromedriver.exe&quot;)
  45. driver = webdriver.Chrome(path)
  46. site_url = (&quot;https://gall.dcinside.com/board/view/?id=baseball_new8&amp;no=10131338&amp;exception_mode=recommend&amp;page=1&quot;)
  47. driver.get(site_url)
  48. images = driver.find_elements_by_xpath(&#39;//div[@class=&quot;writing_view_box&quot;]//img&#39;)
  49. for i, img in enumerate(images, 1):
  50. img_url = img.get_attribute(&#39;src&#39;)
  51. print(i, img_url)
  52. r = requests.get(img_url, headers={&#39;Referer&#39;: site_url})
  53. try: #폴더 만들기
  54. if not os.path.exists(img_folder):
  55. os.makedirs(img_folder)
  56. except Exception as er:
  57. print(&quot;{}에러가 발생했습니다.&quot;.format(er))
  58. break;
  59. break;
  60. with open(&quot;c:/test/{}.jpg&quot;.format(i), &#39;wb&#39;) as f:
  61. f.write(r.content)
  62. I don&#39;t always know the extension of the image.
  63. How do you know the extension of the file you are downloading?
  64. </details>
  65. # 答案1
  66. **得分**: 5
  67. 如果图片链接没有扩展名例如如果图片是通过php脚本动态生成的),那么您可以使用[`mimetypes.guess_extension()`](https://docs.python.org/3/library/mimetypes.html#mimetypes.guess_extension)将图片响应的`content-type`标头映射到文件扩展名。
  68. 例如
  69. ```python
  70. import mimetypes
  71. ...
  72. r = requests.get(img_url, headers={'Referer': site_url})
  73. extension = mimetypes.guess_extension(r.headers.get('content-type', '').split(';')[0])
  74. ...
  75. with open("c:/test/{}{}".format(i, extension or '.jpg'), 'wb') as f:

上面的示例将尝试在存在映射时使用映射的扩展名,但在没有映射时会回退到使用.jpg(例如,如果content-type标头不存在或指定了未知类型)。

英文:

If the image link has no extension (e.g. if the image is dynamically generated from a php script), then you could map the content-type header of the image response to the file extension using mimetypes.guess_extension()

For example:

  1. import mimetypes
  2. ...
  3. r = requests.get(img_url, headers={&#39;Referer&#39;: site_url})
  4. extension = mimetypes.guess_extension(r.headers.get(&#39;content-type&#39;, &#39;&#39;).split(&#39;;&#39;)[0])
  5. ...
  6. with open(&quot;c:/test/{}{}&quot;.format(i, extension or &#39;.jpg&#39;), &#39;wb&#39;) as f:

The example above will try to use the mapped extension when it exists, but will fall back to using .jpg when there is no mapping (e.g. if the content-type header does not exist or specifies an unknown type).

答案2

得分: 1

我遇到了相同的问题,决定手动获取所有标头(使用request.headers.get()),其中一个实际上带有文件扩展名。

在我的情况下,它是'Content-Disposition': 'attachment;filename=%D0%BF%D1%80%D0%BE%D0%B5%D0%BA%D1%82%20%D0%A4%D0%97.doc'

英文:

i bumped into the same problem and decided to get all the headers manually (with request.headers.get())
and one of them actually had a filename with the extension

in my case it was &#39;Content-Disposition&#39;: &#39;attachment;filename=%D0%BF%D1%80%D0%BE%D0%B5%D0%BA%D1%82%20%D0%A4%D0%97.doc&#39;

答案3

得分: 0

我会首先尝试按照/进行拆分,以获取您下载的图片的最后部分,我认为那是图片的文件名(包括扩展名),然后再按.进行拆分,以将文件名与扩展名分开。

  1. img_url = 'path/to/your/picture.jpg'
  2. split1 = img_url.split('/') # 返回 [‘path’, ‘to’, ‘your’, ‘picture.jpg’]
  3. file = split1[-1]
  4. filename, extension = file.split('.') # file 是 ‘picture’,extension 是 ‘jpg’
英文:

I would try to first split on the / to get only the last part which is, I think, the filename (including the extension) of the picture you've downloaded, and then split on the . to separate the filename from the extension.

  1. img_url = &#39;path/to/your/picture.jpg&#39;
  2. split1 = img_url.split(&#39;/&#39;) # returns [&#39;path&#39;, &#39;to&#39;, &#39;your&#39;, &#39;picture.jpg&#39;]
  3. file = split1[-1]
  4. filename, extension = file.split(&#39;.&#39;) # file is &#39;picture&#39; and extension is &#39;jpg&#39;

huangapple
  • 本文由 发表于 2020年1月3日 16:55:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/59575587.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定