2023年2月16日 14:02:30go评论101阅读模式

英文:

Merge small parquet files into a single large parquet file

问题

我一直在尝试合并小的Parquet文件，每个文件都有10,000行，对于每个集合，小文件的数量将在60-100之间。因此，最终形成的合并Parquet文件至少有600,000行。

我一直在尝试使用pandas的concat，对于大约10-15个小文件的合并，它运行良好。

但是，由于集合可能由50-100个文件组成，当运行Python脚本时，由于内存限制被超出，该过程被终止。

因此，我正在寻找一种内存有效的方法，以合并在100个文件集合范围内的任意数量的小Parquet文件。

使用pandas的read_parquet来读取每个单独的数据框，并使用pd.concat（所有数据框）将它们组合在一起。

是否有比pandas更好的库，或者如果可能的话，如何在pandas中高效完成这项任务。时间不是约束条件。它也可以运行相当长的时间。

英文:

I have been trying to merge small parquet files each with 10 k rows and for each set the number of small files will be 60-100. So resulting into around 600k rows minimum in the merged parquet file.

I have been trying to use pandas concat.it is working fine with around 10-15 small files merge.

But as the set may be consists of 50-100 files. The process it is getting killed while running python script with memory limit breached

So i am looking for a memory efficient way to merge any number of small parquet in range of 100 file set

Used pandas read parquet to read each individual dataframe and combine them with pd.conact(all dataframe)

Is there a better library other than pandas or if possible in pandas how it can be done efficiently.

Time is not constraint. It can run for some long time as well.

答案1

得分: 1

你可以逐个打开文件，然后将它们追加到 Parquet 文件中。最好使用 pyarrow 来实现这一点。

import pyarrow.parquet as pq
files = ["table1.parquet", "table2.parquet"]
with pq.ParquetWriter("output.parquet", schema=pq.ParquetFile(files[0]).schema_arrow) as writer:
    for file in files:
        writer.write_table(pq.read_table(file))

英文:

You can open files one by one and append them to the parquet file. Best to use pyarrow for this.

import pyarrow.parquet as pq
files = [&quot;table1.parquet&quot;, &quot;table2.parquet&quot;]
with pq.ParquetWriter(&quot;output.parquet&quot;, schema=pq.ParquetFile(files[0]).schema_arrow) as writer:
    for file in files:
        writer.write_table(pq.read_table(file))

答案2

得分: 0

对于大规模数据，你应该绝对使用PySpark库，如果可能的话，将数据分成较小的部分，然后再使用Pandas。
PySpark非常类似于Pandas。

链接

英文:

For large data you should definitely use the PySpark library, split into smaller sizes if possible, and then use Pandas.
PySpark is very similar to Pandas.

link

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

合并小的Parquet文件成一个大的Parquet文件。

问题

答案1

答案2

如何在pandas中为分类列生成数值映射？

plotly子图：是否可以让一个子图占据多个列或行？

你如何在Jax中实现动态范围上的可映射求和？

ValueError: 形状 (None, 1) 和 (None, 30, 30, 3, 1) 不兼容

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。