在Python中下载一个Zip文件并解压其内容

huangapple go评论63阅读模式
英文:

Downloading a Zip file and extracting its content in Python

问题

我有这段代码,用于从URL下载zip文件并解压缩内容。但是Excel文件的名称每个月都会更改。这将导致创建重复文件。而且不可能每次新数据发布到URL时都能预测到名称。

zip_file_url = "https://www.insee.fr/en/statistiques/series/xlsx/famille/102391902"

import requests, zipfile, io
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))

z.extractall()

最后,我需要在Pandas中加载电子表格。是否可以在不知道zip文件中电子表格名称的情况下完成?

是否可能每次都重命名电子表格,以便覆盖文件并避免创建重复文件?另外,在将来如何加载Pandas中的电子表格,而不知道文件名是什么?

因此,最好的方法是提取文件并以相同的文件名保存并覆盖以前的版本。这意味着我们也知道要在Pandas中加载的电子表格的名称。

英文:

I have this code to download a zip file from a URL and extract the contents.
But the name of the excel file changes every month. This would result in duplicates getting created. And it is not possible to predict the names each time new data gets published in the URL.

zip_file_url = "https://www.insee.fr/en/statistiques/series/xlsx/famille/102391902"  

import requests, zipfile, io  
r = requests.get(zip_file_url)  
z = zipfile.ZipFile(io.BytesIO(r.content))
    
z.extractall()

In the end I need to load the spreadsheet in Pandas. Can it be done without knowing the name of the spreadsheet within the zip file?

Is it possible to rename the Spreadsheet every time so that the file is overwritten and no duplicates are created? Also, how to load the spreadsheet in pandas without knowing the file name in future?

So, the best way would be to extract the file and save under same file name and overwrite the previous version. This means we also know the name of the spreadsheet to be loaded in pandas.

答案1

得分: 1

import os
import requests
import zipfile
import io

# 定义要下载的zip文件的URL
zip_file_url = "https://www.insee.fr/en/statistiques/series/xlsx/famille/102391902"

# 发送请求并获取zip文件内容
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))

# 解压zip文件内容
z.extractall()

# 获取zip归档中的文件名列表
zip_file_names = z.namelist()

# 检查当前目录中是否有一个.xlsx文件
xlsx_files = 
if len(xlsx_files) == 2:
    for name in xlsx_files:
        if name not in zip_file_names:
            xlsx_file_name = name
    # 用zip归档中的文件覆盖现有的.xlsx文件
    if len(zip_file_names) == 1 and zip_file_names[0].endswith(".xlsx"):
        zip_xlsx_file_name = zip_file_names[0]
        os.replace(zip_xlsx_file_name, xlsx_file_name)
        print("文件已成功覆盖。")
英文:
import os
import requests
import zipfile
import io

zip_file_url = "https://www.insee.fr/en/statistiques/series/xlsx/famille/102391902"


r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))


z.extractall()

# Get the list of file names in the zip archive
zip_file_names = z.namelist()

# Check if there is one .xlsx file in the current directory
xlsx_files = 
if len(xlsx_files) == 2:
    for name in xlsx_files:
        if name not in zip_file_names:
            xlsx_file_name = name
    # Overwrite the existing .xlsx file with the file from the zip archive
    if len(zip_file_names) == 1 and zip_file_names[0].endswith(".xlsx"):
        zip_xlsx_file_name = zip_file_names[0]
        os.replace(zip_xlsx_file_name, xlsx_file_name)
        print("File overwritten successfully.")

the idea is that you know there is exactly one xlsx file in the current directory, so u can get it's name (with

)

afterwards, you know that there is exactly one xlsx file in your zip archive too, so you know which file to replace by which file

if len(xlsx_files) == 2:

is useful because the first time you use the script, there will only be one xlsx file in the directory

hope this is clear for you, you may need to adapt this code to your use case

huangapple
  • 本文由 发表于 2023年6月1日 18:13:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76380853.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定