合并小的Parquet文件成一个大的Parquet文件。

huangapple go评论69阅读模式
英文:

Merge small parquet files into a single large parquet file

问题

我一直在尝试合并小的Parquet文件,每个文件都有10,000行,对于每个集合,小文件的数量将在60-100之间。因此,最终形成的合并Parquet文件至少有600,000行。

我一直在尝试使用pandas的concat,对于大约10-15个小文件的合并,它运行良好。

但是,由于集合可能由50-100个文件组成,当运行Python脚本时,由于内存限制被超出,该过程被终止。

因此,我正在寻找一种内存有效的方法,以合并在100个文件集合范围内的任意数量的小Parquet文件。

使用pandas的read_parquet来读取每个单独的数据框,并使用pd.concat(所有数据框)将它们组合在一起。

是否有比pandas更好的库,或者如果可能的话,如何在pandas中高效完成这项任务。时间不是约束条件。它也可以运行相当长的时间。

英文:

I have been trying to merge small parquet files each with 10 k rows and for each set the number of small files will be 60-100. So resulting into around 600k rows minimum in the merged parquet file.

I have been trying to use pandas concat.it is working fine with around 10-15 small files merge.

But as the set may be consists of 50-100 files. The process it is getting killed while running python script with memory limit breached

So i am looking for a memory efficient way to merge any number of small parquet in range of 100 file set

Used pandas read parquet to read each individual dataframe and combine them with pd.conact(all dataframe)

Is there a better library other than pandas or if possible in pandas how it can be done efficiently.

Time is not constraint. It can run for some long time as well.

答案1

得分: 1

你可以逐个打开文件,然后将它们追加到 Parquet 文件中。最好使用 pyarrow 来实现这一点。

import pyarrow.parquet as pq

files = ["table1.parquet", "table2.parquet"]

with pq.ParquetWriter("output.parquet", schema=pq.ParquetFile(files[0]).schema_arrow) as writer:
    for file in files:
        writer.write_table(pq.read_table(file))
英文:

You can open files one by one and append them to the parquet file. Best to use pyarrow for this.

import pyarrow.parquet as pq

files = ["table1.parquet", "table2.parquet"]


with pq.ParquetWriter("output.parquet", schema=pq.ParquetFile(files[0]).schema_arrow) as writer:
    for file in files:
        writer.write_table(pq.read_table(file))

答案2

得分: 0

对于大规模数据,你应该绝对使用PySpark库,如果可能的话,将数据分成较小的部分,然后再使用Pandas。
PySpark非常类似于Pandas。

链接

英文:

For large data you should definitely use the PySpark library, split into smaller sizes if possible, and then use Pandas.
PySpark is very similar to Pandas.

link

huangapple
  • 本文由 发表于 2023年2月16日 14:02:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/75468395.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定