问题

我有一个.parquet文件，想要使用Python快速高效地通过一列查询该文件。

例如，我可能有一个名为name的列在.parquet文件中，想要获取具有所选名称的第一行（或所有行）。

我如何在Polars API或FastParquet（哪个更快就用哪个）中查询这样的parquet文件？

我以为pl.scan_parquet可能会有帮助，但后来发现似乎不是这样，或者我只是不太理解。最好的情况是，尽量不必首先将整个文件读入内存，以减少内存和CPU的使用。

感谢您的帮助。

英文:

I have a .parquet file, and would like to use Python to quickly and efficiently query that file by a column.

For example, I might have a column name in that .parquet file and want to get back the first (or all of) the rows with a chosen name.

How can I query a parquet file like this in the Polars API, or possibly FastParquet (whichever is faster)?

I thought pl.scan_parquet might be helpful but realised it didn't seem so, or I just didn't understand it. Preferably, though it is not essential, we would not have to read the entire file into memory first, to reduce memory and CPU usage.

I thank you for your help.

答案1

得分: 1

Speaking for fastparquet...

Fastparquet是一个用于快速将Parquet数据加载到Pandas数据帧中的库。您没有说明要在其上运行什么查询，但这将由Pandas决定（而且可能会非常快）。Fastparquet在加载阶段允许许多选项，例如筛选值、选择列或选择数据类型，这些选项都可以显着影响加载时间，但会影响您可以执行的查询。如果不知道后者，我们无法就前者提供建议（Polars也会同意）。

英文:

Speaking for fastparquet...

Fastparquet is a library for quickly loading parquet data into a pandas dataframe. You didn't say what query you wanted to run on it, but that would be up to pandas (and probably quite fast). Fastparquet does allow a number of options in the loading stage, for instance to filter values or pick columns or choose dtypes, and these can all make a significant different to load time, but will affect what queries you can then do. Without knowing the latter, we cannot advise on the former (and polars would agree).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

你可以使用 Polars Python API 查询 Parquet 文件吗？

问题

答案1

无法从href中获取URL。

如何在Angelone Smartapi中下达括号订单

存储 H2O 模型/MOJO 文件到文件系统之外

根据不同列值从不同的数据框中复制值。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论