英文:
Minimize the size of a file while saving a pandas dataframe
问题
我想将一个 pandas 数据框写入文件。我有大约 200MB 的 CSV 数据。为了使文件大小最小,我应该写入哪种文件扩展名?
我愿意以二进制形式进行写入,因为我只会使用数据框进行操作。
更新:在我的情况下,使用压缩的 zip 格式在存储方面效果最好。但在运行时间方面,pickle 格式(.pkl)的读取和保存速度最快。我还没有尝试 parquet 和 feather,因为它们需要额外的依赖。
英文:
I want to write a pandas dataframe to a file. I have about 200MB of csv data. Which file extension should I write to such that the file size is the minimum?
I am open to writing in binary as well as I will only be using the dataframe to work.
UPDATE: In my case using the compressed zip format worked the best (storage wise). But run time wise the pickle format(.pkl) was read and saved the fastest. I have not tried paraquet and feather due the additional dependencies it required.
答案1
得分: 1
你可以简单地将你的CSV文件压缩,使用.zip
扩展名代替.csv
:
# 一个只有一个文件的zip归档
df.to_csv('export.zip')
# 或者获取更多控制
df.to_csv('export.zip', compression={'method': 'zip', 'compresslevel': 9})
# 你可以用以下方式读取文件
df = pd.read_csv('export.zip')
英文:
You can simply compress your csv, using .zip
extension instead of .csv
:
# A zip archive with only one file
df.to_csv('export.zip')
# Or to get more control
df.to_csv('export.zip', compression={'method': 'zip', 'compresslevel': 9})
# You can read the file with
df = pd.read_csv('export.zip')
答案2
得分: 0
写入 Parquet 文件可能是一个不错的选择。需要使用 pyarrow 或 fastparquet 库。请参见此处的文档。
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
df.to_parquet('df.parquet.gzip',
compression='gzip')
pd.read_parquet('df.parquet.gzip')
Parquet 文件可以实现高压缩比率。
英文:
Writing to a parquet file may be a good option. Requires either pyarrow or fastparquet libraries. See documentation here
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
df.to_parquet('df.parquet.gzip',
compression='gzip')
pd.read_parquet('df.parquet.gzip')
Parquet files can achieve high compression rates.
答案3
得分: 0
如果您将数据保存到CSV文件中,那么Pandas已经内置了一个名为compression
的关键字(文档)。您可以像这样使用它:
df.to_csv("my_data.csv.zip", compression="zip")
英文:
If you are saving your data to csv files, then pandas already has a built in compression
keyword (doc)
you can use it like this:
df.to_csv("my_data.csv.zip", compression="zip")
答案4
得分: 0
我创建了一个测试数据框,它具有伪面板样式的格式。显然,压缩的程度等都取决于您的数据。如果您的数据完全是一遍又一遍地重复相同的内容,压缩比将会很高。如果您的数据从不重复,压缩比将会很低。
要针对您的数据获得答案,可以使用 df.sample(10_000)
(或类似的方式)获取数据的样本,然后执行像下面我的代码一样将其保存成不同的格式,然后比较它们的大小。
import random
df = pd.DataFrame({
'd': range(0, 10_000),
's': [random.choice(['alpha', 'beta', 'gamma', 'delta'])
for _ in range(0, 10_000)],
'i': [random.randint(0, 1000) for _ in range(0, 10_000)]
})
接着,我查询了以下保存格式的长度。
l = []
for p in ['.csv', '.csv.gz', '.csv.xz', '.csv.bz2', '.csv.zip']:
df.to_csv('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
for p in ['.pkl', '.pkl.gz', '.pkl.xz', '.pkl.bz2']:
df.to_pickle('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
for p in ['.xls', '.xlsx']:
df.to_excel('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
for p in ['.dta', '.dta.gz', '.dta.xz', '.dta.bz2']:
df.to_stata('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
cr = pd.DataFrame(l)
cr['ratio'] = cr['size'] / cr.loc[0, 'size']
cr.sort_values('ratio', inplace=True)
这产生了以下表格:
name size ratio
7 temp.pkl.xz 22532 0.110395
8 temp.pkl.bz2 23752 0.116372
13 temp.dta.xz 39276 0.192431
6 temp.pkl.gz 40619 0.199011
2 temp.csv.xz 42332 0.207404
14 temp.dta.bz2 51694 0.253273
3 temp.csv.bz2 54801 0.268495
12 temp.dta.gz 57513 0.281783
1 temp.csv.gz 70219 0.344035
4 temp.csv.zip 70837 0.347063
11 temp.dta 170912 0.837377
5 temp.pkl 180865 0.886141
0 temp.csv 204104 1.000000
10 temp.xlsx 216828 1.062341
9 temp.xls 711168 3.484341
我没有尝试使用 to_parquet
或 to_feather
,因为它们需要 pyarrow
依赖项,而这在Anaconda中不是标准的。
将数据导出到Excel 2003格式会引发警告,指出 xlwt
不再维护并将被移除。考虑到其Python实现的文件大小如此巨大,这不算是重大损失。
英文:
I created a test data frame which has a pseudo-panel-like format. Obviously, the extent of your compression etc will always depend on your data. If your data are literally the same thing repeated over and over again, compression ratios will be high. If your data never repeat, compression ratios will be low.
To get answers for your data, take a sample of your data with df.sample(10_000)
(or something like that) and execute code like mine below which saves it in different formats. Then compare the sizes.
import random
df = pd.DataFrame({
'd': range(0, 10_000),
's': [random.choice(['alpha', 'beta', 'gamma', 'delta'])
for _ in range(0, 10_000)],
'i': [random.randint(0, 1000) for _ in range(0, 10_000)]
})
I then queried the length of the following save formats.
l = []
for p in ['.csv', '.csv.gz', '.csv.xz', '.csv.bz2', '.csv.zip']:
df.to_csv('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
for p in ['.pkl', '.pkl.gz', '.pkl.xz', '.pkl.bz2']:
df.to_pickle('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
for p in ['.xls', '.xlsx']:
df.to_excel('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
for p in ['.dta', '.dta.gz', '.dta.xz', '.dta.bz2']:
df.to_stata('temp' + p)
l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
cr = pd.DataFrame(l)
cr['ratio'] = cr['size'] / cr.loc[0, 'size']
cr.sort_values('ratio', inplace=True)
That yielded the following table:
name size ratio
7 temp.pkl.xz 22532 0.110395
8 temp.pkl.bz2 23752 0.116372
13 temp.dta.xz 39276 0.192431
6 temp.pkl.gz 40619 0.199011
2 temp.csv.xz 42332 0.207404
14 temp.dta.bz2 51694 0.253273
3 temp.csv.bz2 54801 0.268495
12 temp.dta.gz 57513 0.281783
1 temp.csv.gz 70219 0.344035
4 temp.csv.zip 70837 0.347063
11 temp.dta 170912 0.837377
5 temp.pkl 180865 0.886141
0 temp.csv 204104 1.000000
10 temp.xlsx 216828 1.062341
9 temp.xls 711168 3.484341
I did not try to_parquet
or to_feather
because they require dependency pyarrow
, which is non-standard in Anaconda.
Running the export to Excel 2003's format threw a warning that xlwt
is no longer maintained and will be removed. Inasmuch as its Python implementation's file size is so huge, it is of no major loss.
答案5
得分: -1
使用标准的Pandas库,pickle二进制文件是一种可行的方法。要获取详细信息,您可能会发现以下视频很有用:
https://www.youtube.com/watch?v=u4rsA5ZiTls&t=150s
英文:
Using standard Pandas library, pickle binary is the way to go. For a detailed information, you might find the following video to be useful
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论