2023年2月14日 21:11:27go评论165阅读模式

英文:

Minimize the size of a file while saving a pandas dataframe

问题

我想将一个 pandas 数据框写入文件。我有大约 200MB 的 CSV 数据。为了使文件大小最小，我应该写入哪种文件扩展名？

我愿意以二进制形式进行写入，因为我只会使用数据框进行操作。

更新：在我的情况下，使用压缩的 zip 格式在存储方面效果最好。但在运行时间方面，pickle 格式（.pkl）的读取和保存速度最快。我还没有尝试 parquet 和 feather，因为它们需要额外的依赖。

英文:

I want to write a pandas dataframe to a file. I have about 200MB of csv data. Which file extension should I write to such that the file size is the minimum?

I am open to writing in binary as well as I will only be using the dataframe to work.

UPDATE: In my case using the compressed zip format worked the best (storage wise). But run time wise the pickle format(.pkl) was read and saved the fastest. I have not tried paraquet and feather due the additional dependencies it required.

答案1

得分: 1

你可以简单地将你的CSV文件压缩，使用.zip扩展名代替.csv：

# 一个只有一个文件的zip归档
df.to_csv('export.zip')

# 或者获取更多控制
df.to_csv('export.zip', compression={'method': 'zip', 'compresslevel': 9})

# 你可以用以下方式读取文件
df = pd.read_csv('export.zip')

英文:

You can simply compress your csv, using .zip extension instead of .csv:

# A zip archive with only one file
df.to_csv(&#39;export.zip&#39;)

# Or to get more control
df.to_csv(&#39;export.zip&#39;, compression={&#39;method&#39;: &#39;zip&#39;, &#39;compresslevel&#39;: 9})

# You can read the file with
df = pd.read_csv(&#39;export.zip&#39;)

答案2

得分: 0

写入 Parquet 文件可能是一个不错的选择。需要使用 pyarrow 或 fastparquet 库。请参见此处的文档。

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
df.to_parquet('df.parquet.gzip',
              compression='gzip')
pd.read_parquet('df.parquet.gzip')

Parquet 文件可以实现高压缩比率。

英文:

Writing to a parquet file may be a good option. Requires either pyarrow or fastparquet libraries. See documentation here


    df = pd.DataFrame(data={&#39;col1&#39;: [1, 2], &#39;col2&#39;: [3, 4]})
    df.to_parquet(&#39;df.parquet.gzip&#39;,
                  compression=&#39;gzip&#39;)  
    pd.read_parquet(&#39;df.parquet.gzip&#39;)

Parquet files can achieve high compression rates.

答案3

得分: 0

如果您将数据保存到CSV文件中，那么Pandas已经内置了一个名为compression的关键字（文档）。您可以像这样使用它：

df.to_csv("my_data.csv.zip", compression="zip")

英文:

If you are saving your data to csv files, then pandas already has a built in compression keyword (doc)

you can use it like this:

df.to_csv(&quot;my_data.csv.zip&quot;, compression=&quot;zip&quot;)

答案4

得分: 0

我创建了一个测试数据框，它具有伪面板样式的格式。显然，压缩的程度等都取决于您的数据。如果您的数据完全是一遍又一遍地重复相同的内容，压缩比将会很高。如果您的数据从不重复，压缩比将会很低。

要针对您的数据获得答案，可以使用 df.sample(10_000)（或类似的方式）获取数据的样本，然后执行像下面我的代码一样将其保存成不同的格式，然后比较它们的大小。

import random
df = pd.DataFrame({
    'd': range(0, 10_000),
    's': [random.choice(['alpha', 'beta', 'gamma', 'delta'])
          for _ in range(0, 10_000)],
    'i': [random.randint(0, 1000) for _ in range(0, 10_000)]
})

接着，我查询了以下保存格式的长度。

l = []
for p in ['.csv', '.csv.gz', '.csv.xz', '.csv.bz2', '.csv.zip']:
    df.to_csv('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
    
for p in ['.pkl', '.pkl.gz', '.pkl.xz', '.pkl.bz2']:
    df.to_pickle('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
    
for p in ['.xls', '.xlsx']:
    df.to_excel('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
    
for p in ['.dta', '.dta.gz', '.dta.xz', '.dta.bz2']:
    df.to_stata('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
    
cr = pd.DataFrame(l)
cr['ratio'] = cr['size'] / cr.loc[0, 'size']
cr.sort_values('ratio', inplace=True)

这产生了以下表格：

            name    size     ratio
7    temp.pkl.xz   22532  0.110395
8   temp.pkl.bz2   23752  0.116372
13   temp.dta.xz   39276  0.192431
6    temp.pkl.gz   40619  0.199011
2    temp.csv.xz   42332  0.207404
14  temp.dta.bz2   51694  0.253273
3   temp.csv.bz2   54801  0.268495
12   temp.dta.gz   57513  0.281783
1    temp.csv.gz   70219  0.344035
4   temp.csv.zip   70837  0.347063
11      temp.dta  170912  0.837377
5       temp.pkl  180865  0.886141
0       temp.csv  204104  1.000000
10     temp.xlsx  216828  1.062341
9       temp.xls  711168  3.484341

我没有尝试使用 to_parquet 或 to_feather，因为它们需要 pyarrow 依赖项，而这在Anaconda中不是标准的。

将数据导出到Excel 2003格式会引发警告，指出 xlwt 不再维护并将被移除。考虑到其Python实现的文件大小如此巨大，这不算是重大损失。

英文:

I created a test data frame which has a pseudo-panel-like format. Obviously, the extent of your compression etc will always depend on your data. If your data are literally the same thing repeated over and over again, compression ratios will be high. If your data never repeat, compression ratios will be low.

To get answers for your data, take a sample of your data with df.sample(10_000) (or something like that) and execute code like mine below which saves it in different formats. Then compare the sizes.

import random
df = pd.DataFrame({
    &#39;d&#39;: range(0, 10_000),
    &#39;s&#39;: [random.choice([&#39;alpha&#39;, &#39;beta&#39;, &#39;gamma&#39;, &#39;delta&#39;])
          for _ in range(0, 10_000)],
    &#39;i&#39;: [random.randint(0, 1000) for _ in range(0, 10_000)]
})

I then queried the length of the following save formats.

l = []
for p in [&#39;.csv&#39;, &#39;.csv.gz&#39;, &#39;.csv.xz&#39;, &#39;.csv.bz2&#39;, &#39;.csv.zip&#39;]:
    df.to_csv(&#39;temp&#39; + p)
    l.append({&#39;name&#39;: &#39;temp&#39; + p, &#39;size&#39;: getsize(&#39;temp&#39; + p)})

for p in [&#39;.pkl&#39;, &#39;.pkl.gz&#39;, &#39;.pkl.xz&#39;, &#39;.pkl.bz2&#39;]:
    df.to_pickle(&#39;temp&#39; + p)
    l.append({&#39;name&#39;: &#39;temp&#39; + p, &#39;size&#39;: getsize(&#39;temp&#39; + p)})

for p in [&#39;.xls&#39;, &#39;.xlsx&#39;]:
    df.to_excel(&#39;temp&#39; + p)
    l.append({&#39;name&#39;: &#39;temp&#39; + p, &#39;size&#39;: getsize(&#39;temp&#39; + p)})
    
for p in [&#39;.dta&#39;, &#39;.dta.gz&#39;, &#39;.dta.xz&#39;, &#39;.dta.bz2&#39;]:
    df.to_stata(&#39;temp&#39; + p)
    l.append({&#39;name&#39;: &#39;temp&#39; + p, &#39;size&#39;: getsize(&#39;temp&#39; + p)})

cr = pd.DataFrame(l)
cr[&#39;ratio&#39;] = cr[&#39;size&#39;] / cr.loc[0, &#39;size&#39;]
cr.sort_values(&#39;ratio&#39;, inplace=True)

That yielded the following table:

            name    size     ratio
7    temp.pkl.xz   22532  0.110395
8   temp.pkl.bz2   23752  0.116372
13   temp.dta.xz   39276  0.192431
6    temp.pkl.gz   40619  0.199011
2    temp.csv.xz   42332  0.207404
14  temp.dta.bz2   51694  0.253273
3   temp.csv.bz2   54801  0.268495
12   temp.dta.gz   57513  0.281783
1    temp.csv.gz   70219  0.344035
4   temp.csv.zip   70837  0.347063
11      temp.dta  170912  0.837377
5       temp.pkl  180865  0.886141
0       temp.csv  204104  1.000000
10     temp.xlsx  216828  1.062341
9       temp.xls  711168  3.484341

I did not try to_parquet or to_feather because they require dependency pyarrow, which is non-standard in Anaconda.

Running the export to Excel 2003's format threw a warning that xlwt is no longer maintained and will be removed. Inasmuch as its Python implementation's file size is so huge, it is of no major loss.

答案5

得分: -1

使用标准的Pandas库，pickle二进制文件是一种可行的方法。要获取详细信息，您可能会发现以下视频很有用：

https://www.youtube.com/watch?v=u4rsA5ZiTls&t=150s

英文:

Using standard Pandas library, pickle binary is the way to go. For a detailed information, you might find the following video to be useful

https://www.youtube.com/watch?v=u4rsA5ZiTls&t=150s

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

减小文件大小，同时保存pandas数据框。

问题

答案1

答案2

答案3

答案4

答案5

Welcome emails sent by Frappe framework are missing the port number

Python Requests POST / 上传图像文件

Delaunay三角剖分在绘制两个不等式的交集区域时没有正确构建。

树迭代中的一些问题，Python中树可以动态生成。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论