英文:
Downloading a Zip file and extracting its content in Python
问题
我有这段代码,用于从URL下载zip文件并解压缩内容。但是Excel文件的名称每个月都会更改。这将导致创建重复文件。而且不可能每次新数据发布到URL时都能预测到名称。
zip_file_url = "https://www.insee.fr/en/statistiques/series/xlsx/famille/102391902"
import requests, zipfile, io
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()
最后,我需要在Pandas中加载电子表格。是否可以在不知道zip文件中电子表格名称的情况下完成?
是否可能每次都重命名电子表格,以便覆盖文件并避免创建重复文件?另外,在将来如何加载Pandas中的电子表格,而不知道文件名是什么?
因此,最好的方法是提取文件并以相同的文件名保存并覆盖以前的版本。这意味着我们也知道要在Pandas中加载的电子表格的名称。
英文:
I have this code to download a zip file from a URL and extract the contents.
But the name of the excel file changes every month. This would result in duplicates getting created. And it is not possible to predict the names each time new data gets published in the URL.
zip_file_url = "https://www.insee.fr/en/statistiques/series/xlsx/famille/102391902"
import requests, zipfile, io
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()
In the end I need to load the spreadsheet in Pandas. Can it be done without knowing the name of the spreadsheet within the zip file?
Is it possible to rename the Spreadsheet every time so that the file is overwritten and no duplicates are created? Also, how to load the spreadsheet in pandas without knowing the file name in future?
So, the best way would be to extract the file and save under same file name and overwrite the previous version. This means we also know the name of the spreadsheet to be loaded in pandas.
答案1
得分: 1
import os
import requests
import zipfile
import io
# 定义要下载的zip文件的URL
zip_file_url = "https://www.insee.fr/en/statistiques/series/xlsx/famille/102391902"
# 发送请求并获取zip文件内容
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
# 解压zip文件内容
z.extractall()
# 获取zip归档中的文件名列表
zip_file_names = z.namelist()
# 检查当前目录中是否有一个.xlsx文件
xlsx_files =
if len(xlsx_files) == 2:
for name in xlsx_files:
if name not in zip_file_names:
xlsx_file_name = name
# 用zip归档中的文件覆盖现有的.xlsx文件
if len(zip_file_names) == 1 and zip_file_names[0].endswith(".xlsx"):
zip_xlsx_file_name = zip_file_names[0]
os.replace(zip_xlsx_file_name, xlsx_file_name)
print("文件已成功覆盖。")
英文:
import os
import requests
import zipfile
import io
zip_file_url = "https://www.insee.fr/en/statistiques/series/xlsx/famille/102391902"
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()
# Get the list of file names in the zip archive
zip_file_names = z.namelist()
# Check if there is one .xlsx file in the current directory
xlsx_files =
if len(xlsx_files) == 2:
for name in xlsx_files:
if name not in zip_file_names:
xlsx_file_name = name
# Overwrite the existing .xlsx file with the file from the zip archive
if len(zip_file_names) == 1 and zip_file_names[0].endswith(".xlsx"):
zip_xlsx_file_name = zip_file_names[0]
os.replace(zip_xlsx_file_name, xlsx_file_name)
print("File overwritten successfully.")
the idea is that you know there is exactly one xlsx file in the current directory, so u can get it's name (with
)afterwards, you know that there is exactly one xlsx file in your zip archive too, so you know which file to replace by which file
if len(xlsx_files) == 2:
is useful because the first time you use the script, there will only be one xlsx file in the directory
hope this is clear for you, you may need to adapt this code to your use case
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论