2023年5月24日 21:50:19go评论160阅读模式

英文:

Pandas/Dask read_parquet columns case insensitive

问题

在pd.read_parquet()中是否可以使用一个columns参数来过滤列，但不区分大小写？我有一些具有相同列名的文件，但有些是驼峰命名，有些是全部大写，有些是小写，很混乱，我不能读取所有列然后再进行过滤，有时必须直接读取到pandas。

我知道read_csv有一个usecols参数，可以是可调用的，所以当文件是csv格式时，我可以这样做：pd.read_csv(filepath, usecols=lambda col: col.lower() in cols)

但是read_parquet的columns参数不能是可调用的，我该如何实现类似的功能？

英文:

Can i have a columns argument on pd.read_parquet() that filters columns, but is case insensitive, I have files with the same columns, but some are camel case, some are all capital, some are lowercase, it is a mess, and i can't read all columns and filter afterwards, and sometimes I have to read directly to pandas.

I know read_csv has a usecols argument that can be callable, so when the files are csvs I can do this: pd.read_csv(filepath, usecols=lambda col: col.lower() in cols)

But read_parquet columns argument can't be callable, how can I do something similar?

答案1

得分: 2

这只是一个权宜之计，但可以使用dask来延迟加载Parquet文件，检查列列表，选择感兴趣的列并进行实际加载（或继续以延迟方式操作）。

以下是大致的伪代码：

from dask.dataframe import read_parquet

ddf = read_parquet("some_parquet")

# 选择列
cols_of_interest = [c for c in ddf.columns if c.lower() in cols]

# 继续处理dask.dataframe
ddf = read_parquet("some_parquet", columns=cols_of_interest)

# 或者如有需要，转换为pandas数据框
df = ddf.compute()

英文:

This is a workaround only, but what one can do is use dask to lazy-load the parquet, inspect the column list, pick the ones of interest and do the actual load (or continue in the lazy fashion).

Here's the rough pseudocode:

from dask.dataframe import read_parquet

ddf = read_parquet(&quot;some_parquet&quot;)

# select columns
cols_of_interest = [c for c in ddf.columns if c.lower() in cols]

# continue with the dask.dataframe
ddf = read_parquet(&quot;some_parquet&quot;, columns= cols_of_interest)

# or convert to pandas, if necessary
df = ddf.compute()

答案2

得分: 2

你可以使用 pyarrow：

import pyarrow.parquet as pq

metadata = pq.read_metadata('data.parquet')

cols = ['col1', 'col3']
cols = [c for c in metadata.schema.names if c.lower() in cols]

df = pd.read_parquet('data.parquet', columns=cols)
df.columns = df.columns.str.lower()

输出：

>>> metadata.schema.names
['COL1', 'col2', 'Col3']

>>> df
        col1      col3
0   9.451444  8.799611
1   3.805668  9.194838
2   1.643645  5.300303
3   4.782400  0.301559
4   8.264088  9.652009
..       ...       ...
95  0.248484  2.904245
96  3.572653  6.826785
97  3.063543  8.223073
98  2.060533  9.996808
99  5.724856  3.476133

[100 rows x 2 columns]

英文:

You can use pyarrow:

import pyarrow.parquet as pq

metadata = pq.read_metadata(&#39;data.parquet&#39;)

cols = [&#39;col1&#39;, &#39;col3&#39;]
cols = [c for c in metadata.schema.names if c.lower() in cols]

df = pd.read_parquet(&#39;data.parquet&#39;, columns=cols)
df.columns = df.columns.str.lower()

Output:

&gt;&gt;&gt; metadata.schema.names
[&#39;COL1&#39;, &#39;col2&#39;, &#39;Col3&#39;]

&gt;&gt;&gt; df
        col1      col3
0   9.451444  8.799611
1   3.805668  9.194838
2   1.643645  5.300303
3   4.782400  0.301559
4   8.264088  9.652009
..       ...       ...
95  0.248484  2.904245
96  3.572653  6.826785
97  3.063543  8.223073
98  2.060533  9.996808
99  5.724856  3.476133

[100 rows x 2 columns]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas/Dask 读取 Parquet 文件时不区分大小写的列名

问题

答案1

答案2

在冒号之前和之后获取文本的Python代码。

如何提取和操作Airflow DAG执行日期的小时作为变量？

用SQLAlchemy和mypy定义的模型在初始化期间需要一个关系参数。

如何在嵌套循环中打印一次迭代结果

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论