减小文件大小,同时保存pandas数据框。

huangapple go评论75阅读模式
英文:

Minimize the size of a file while saving a pandas dataframe

问题

我想将一个 pandas 数据框写入文件。我有大约 200MB 的 CSV 数据。为了使文件大小最小,我应该写入哪种文件扩展名?

我愿意以二进制形式进行写入,因为我只会使用数据框进行操作。

更新:在我的情况下,使用压缩的 zip 格式在存储方面效果最好。但在运行时间方面,pickle 格式(.pkl)的读取和保存速度最快。我还没有尝试 parquet 和 feather,因为它们需要额外的依赖。

英文:

I want to write a pandas dataframe to a file. I have about 200MB of csv data. Which file extension should I write to such that the file size is the minimum?

I am open to writing in binary as well as I will only be using the dataframe to work.

UPDATE: In my case using the compressed zip format worked the best (storage wise). But run time wise the pickle format(.pkl) was read and saved the fastest. I have not tried paraquet and feather due the additional dependencies it required.

答案1

得分: 1

你可以简单地将你的CSV文件压缩,使用.zip扩展名代替.csv

# 一个只有一个文件的zip归档
df.to_csv('export.zip')

# 或者获取更多控制
df.to_csv('export.zip', compression={'method': 'zip', 'compresslevel': 9})

# 你可以用以下方式读取文件
df = pd.read_csv('export.zip')
英文:

You can simply compress your csv, using .zip extension instead of .csv:

# A zip archive with only one file
df.to_csv('export.zip')

# Or to get more control
df.to_csv('export.zip', compression={'method': 'zip', 'compresslevel': 9})

# You can read the file with
df = pd.read_csv('export.zip')

答案2

得分: 0

写入 Parquet 文件可能是一个不错的选择。需要使用 pyarrow 或 fastparquet 库。请参见此处的文档

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
df.to_parquet('df.parquet.gzip',
              compression='gzip')
pd.read_parquet('df.parquet.gzip')

Parquet 文件可以实现高压缩比率。

英文:

Writing to a parquet file may be a good option. Requires either pyarrow or fastparquet libraries. See documentation here


    df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
    df.to_parquet('df.parquet.gzip',
                  compression='gzip')  
    pd.read_parquet('df.parquet.gzip')

Parquet files can achieve high compression rates.

答案3

得分: 0

如果您将数据保存到CSV文件中,那么Pandas已经内置了一个名为compression的关键字(文档)。您可以像这样使用它:

df.to_csv("my_data.csv.zip", compression="zip")
英文:

If you are saving your data to csv files, then pandas already has a built in compression keyword (doc)

you can use it like this:

df.to_csv("my_data.csv.zip", compression="zip")

答案4

得分: 0

我创建了一个测试数据框,它具有伪面板样式的格式。显然,压缩的程度等都取决于您的数据。如果您的数据完全是一遍又一遍地重复相同的内容,压缩比将会很高。如果您的数据从不重复,压缩比将会很低。

要针对您的数据获得答案,可以使用 df.sample(10_000)(或类似的方式)获取数据的样本,然后执行像下面我的代码一样将其保存成不同的格式,然后比较它们的大小。

import random
df = pd.DataFrame({
    'd': range(0, 10_000),
    's': [random.choice(['alpha', 'beta', 'gamma', 'delta'])
          for _ in range(0, 10_000)],
    'i': [random.randint(0, 1000) for _ in range(0, 10_000)]
})

接着,我查询了以下保存格式的长度。

l = []
for p in ['.csv', '.csv.gz', '.csv.xz', '.csv.bz2', '.csv.zip']:
    df.to_csv('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
    
for p in ['.pkl', '.pkl.gz', '.pkl.xz', '.pkl.bz2']:
    df.to_pickle('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
    
for p in ['.xls', '.xlsx']:
    df.to_excel('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
    
for p in ['.dta', '.dta.gz', '.dta.xz', '.dta.bz2']:
    df.to_stata('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
    
cr = pd.DataFrame(l)
cr['ratio'] = cr['size'] / cr.loc[0, 'size']
cr.sort_values('ratio', inplace=True)

这产生了以下表格:

            name    size     ratio
7    temp.pkl.xz   22532  0.110395
8   temp.pkl.bz2   23752  0.116372
13   temp.dta.xz   39276  0.192431
6    temp.pkl.gz   40619  0.199011
2    temp.csv.xz   42332  0.207404
14  temp.dta.bz2   51694  0.253273
3   temp.csv.bz2   54801  0.268495
12   temp.dta.gz   57513  0.281783
1    temp.csv.gz   70219  0.344035
4   temp.csv.zip   70837  0.347063
11      temp.dta  170912  0.837377
5       temp.pkl  180865  0.886141
0       temp.csv  204104  1.000000
10     temp.xlsx  216828  1.062341
9       temp.xls  711168  3.484341

我没有尝试使用 to_parquetto_feather,因为它们需要 pyarrow 依赖项,而这在Anaconda中不是标准的。

将数据导出到Excel 2003格式会引发警告,指出 xlwt 不再维护并将被移除。考虑到其Python实现的文件大小如此巨大,这不算是重大损失。

英文:

I created a test data frame which has a pseudo-panel-like format. Obviously, the extent of your compression etc will always depend on your data. If your data are literally the same thing repeated over and over again, compression ratios will be high. If your data never repeat, compression ratios will be low.

To get answers for your data, take a sample of your data with df.sample(10_000) (or something like that) and execute code like mine below which saves it in different formats. Then compare the sizes.

import random
df = pd.DataFrame({
    'd': range(0, 10_000),
    's': [random.choice(['alpha', 'beta', 'gamma', 'delta'])
          for _ in range(0, 10_000)],
    'i': [random.randint(0, 1000) for _ in range(0, 10_000)]
})

I then queried the length of the following save formats.

l = []
for p in ['.csv', '.csv.gz', '.csv.xz', '.csv.bz2', '.csv.zip']:
    df.to_csv('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})

for p in ['.pkl', '.pkl.gz', '.pkl.xz', '.pkl.bz2']:
    df.to_pickle('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})

for p in ['.xls', '.xlsx']:
    df.to_excel('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})
    
for p in ['.dta', '.dta.gz', '.dta.xz', '.dta.bz2']:
    df.to_stata('temp' + p)
    l.append({'name': 'temp' + p, 'size': getsize('temp' + p)})

cr = pd.DataFrame(l)
cr['ratio'] = cr['size'] / cr.loc[0, 'size']
cr.sort_values('ratio', inplace=True)

That yielded the following table:

            name    size     ratio
7    temp.pkl.xz   22532  0.110395
8   temp.pkl.bz2   23752  0.116372
13   temp.dta.xz   39276  0.192431
6    temp.pkl.gz   40619  0.199011
2    temp.csv.xz   42332  0.207404
14  temp.dta.bz2   51694  0.253273
3   temp.csv.bz2   54801  0.268495
12   temp.dta.gz   57513  0.281783
1    temp.csv.gz   70219  0.344035
4   temp.csv.zip   70837  0.347063
11      temp.dta  170912  0.837377
5       temp.pkl  180865  0.886141
0       temp.csv  204104  1.000000
10     temp.xlsx  216828  1.062341
9       temp.xls  711168  3.484341

I did not try to_parquet or to_feather because they require dependency pyarrow, which is non-standard in Anaconda.

Running the export to Excel 2003's format threw a warning that xlwt is no longer maintained and will be removed. Inasmuch as its Python implementation's file size is so huge, it is of no major loss.

答案5

得分: -1

使用标准的Pandas库,pickle二进制文件是一种可行的方法。要获取详细信息,您可能会发现以下视频很有用:

https://www.youtube.com/watch?v=u4rsA5ZiTls&t=150s

英文:

Using standard Pandas library, pickle binary is the way to go. For a detailed information, you might find the following video to be useful

https://www.youtube.com/watch?v=u4rsA5ZiTls&t=150s

huangapple
  • 本文由 发表于 2023年2月14日 21:11:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/75448347.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定