Pandas/Dask 读取 Parquet 文件时不区分大小写的列名

huangapple go评论61阅读模式
英文:

Pandas/Dask read_parquet columns case insensitive

问题

pd.read_parquet()中是否可以使用一个columns参数来过滤列,但不区分大小写?我有一些具有相同列名的文件,但有些是驼峰命名,有些是全部大写,有些是小写,很混乱,我不能读取所有列然后再进行过滤,有时必须直接读取到pandas。

我知道read_csv有一个usecols参数,可以是可调用的,所以当文件是csv格式时,我可以这样做:pd.read_csv(filepath, usecols=lambda col: col.lower() in cols)

但是read_parquetcolumns参数不能是可调用的,我该如何实现类似的功能?

英文:

Can i have a columns argument on pd.read_parquet() that filters columns, but is case insensitive, I have files with the same columns, but some are camel case, some are all capital, some are lowercase, it is a mess, and i can't read all columns and filter afterwards, and sometimes I have to read directly to pandas.

I know read_csv has a usecols argument that can be callable, so when the files are csvs I can do this: pd.read_csv(filepath, usecols=lambda col: col.lower() in cols)

But read_parquet columns argument can't be callable, how can I do something similar?

答案1

得分: 2

这只是一个权宜之计,但可以使用dask来延迟加载Parquet文件,检查列列表,选择感兴趣的列并进行实际加载(或继续以延迟方式操作)。

以下是大致的伪代码:

from dask.dataframe import read_parquet

ddf = read_parquet("some_parquet")

# 选择列
cols_of_interest = [c for c in ddf.columns if c.lower() in cols]

# 继续处理dask.dataframe
ddf = read_parquet("some_parquet", columns=cols_of_interest)

# 或者如有需要,转换为pandas数据框
df = ddf.compute()
英文:

This is a workaround only, but what one can do is use dask to lazy-load the parquet, inspect the column list, pick the ones of interest and do the actual load (or continue in the lazy fashion).

Here's the rough pseudocode:

from dask.dataframe import read_parquet

ddf = read_parquet("some_parquet")

# select columns
cols_of_interest = [c for c in ddf.columns if c.lower() in cols]

# continue with the dask.dataframe
ddf = read_parquet("some_parquet", columns= cols_of_interest)

# or convert to pandas, if necessary
df = ddf.compute()

答案2

得分: 2

你可以使用 pyarrow

import pyarrow.parquet as pq

metadata = pq.read_metadata('data.parquet')

cols = ['col1', 'col3']
cols = [c for c in metadata.schema.names if c.lower() in cols]

df = pd.read_parquet('data.parquet', columns=cols)
df.columns = df.columns.str.lower()

输出:

>>> metadata.schema.names
['COL1', 'col2', 'Col3']

>>> df
        col1      col3
0   9.451444  8.799611
1   3.805668  9.194838
2   1.643645  5.300303
3   4.782400  0.301559
4   8.264088  9.652009
..       ...       ...
95  0.248484  2.904245
96  3.572653  6.826785
97  3.063543  8.223073
98  2.060533  9.996808
99  5.724856  3.476133

[100 rows x 2 columns]
英文:

You can use pyarrow:

import pyarrow.parquet as pq

metadata = pq.read_metadata('data.parquet')

cols = ['col1', 'col3']
cols = [c for c in metadata.schema.names if c.lower() in cols]

df = pd.read_parquet('data.parquet', columns=cols)
df.columns = df.columns.str.lower()

Output:

>>> metadata.schema.names
['COL1', 'col2', 'Col3']

>>> df
        col1      col3
0   9.451444  8.799611
1   3.805668  9.194838
2   1.643645  5.300303
3   4.782400  0.301559
4   8.264088  9.652009
..       ...       ...
95  0.248484  2.904245
96  3.572653  6.826785
97  3.063543  8.223073
98  2.060533  9.996808
99  5.724856  3.476133

[100 rows x 2 columns]

huangapple
  • 本文由 发表于 2023年5月24日 21:50:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76324263.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定