英文:
How can I query parquet files with the Polars Python API?
问题
我有一个.parquet
文件,想要使用Python快速高效地通过一列查询该文件。
例如,我可能有一个名为name
的列在.parquet
文件中,想要获取具有所选名称的第一行(或所有行)。
我如何在Polars API或FastParquet(哪个更快就用哪个)中查询这样的parquet文件?
我以为pl.scan_parquet
可能会有帮助,但后来发现似乎不是这样,或者我只是不太理解。最好的情况是,尽量不必首先将整个文件读入内存,以减少内存和CPU的使用。
感谢您的帮助。
英文:
I have a .parquet
file, and would like to use Python to quickly and efficiently query that file by a column.
For example, I might have a column name
in that .parquet
file and want to get back the first (or all of) the rows with a chosen name.
How can I query a parquet file like this in the Polars API, or possibly FastParquet (whichever is faster)?
I thought pl.scan_parquet
might be helpful but realised it didn't seem so, or I just didn't understand it. Preferably, though it is not essential, we would not have to read the entire file into memory first, to reduce memory and CPU usage.
I thank you for your help.
答案1
得分: 1
Speaking for fastparquet...
Fastparquet是一个用于快速将Parquet数据加载到Pandas数据帧中的库。您没有说明要在其上运行什么查询,但这将由Pandas决定(而且可能会非常快)。Fastparquet在加载阶段允许许多选项,例如筛选值、选择列或选择数据类型,这些选项都可以显着影响加载时间,但会影响您可以执行的查询。如果不知道后者,我们无法就前者提供建议(Polars也会同意)。
英文:
Speaking for fastparquet...
Fastparquet is a library for quickly loading parquet data into a pandas dataframe. You didn't say what query you wanted to run on it, but that would be up to pandas (and probably quite fast). Fastparquet does allow a number of options in the loading stage, for instance to filter values or pick columns or choose dtypes, and these can all make a significant different to load time, but will affect what queries you can then do. Without knowing the latter, we cannot advise on the former (and polars would agree).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论