英文:
Pandas/Dask read_parquet columns case insensitive
问题
在pd.read_parquet()
中是否可以使用一个columns参数来过滤列,但不区分大小写?我有一些具有相同列名的文件,但有些是驼峰命名,有些是全部大写,有些是小写,很混乱,我不能读取所有列然后再进行过滤,有时必须直接读取到pandas。
我知道read_csv
有一个usecols
参数,可以是可调用的,所以当文件是csv格式时,我可以这样做:pd.read_csv(filepath, usecols=lambda col: col.lower() in cols)
但是read_parquet
的columns参数不能是可调用的,我该如何实现类似的功能?
英文:
Can i have a columns argument on pd.read_parquet() that filters columns, but is case insensitive, I have files with the same columns, but some are camel case, some are all capital, some are lowercase, it is a mess, and i can't read all columns and filter afterwards, and sometimes I have to read directly to pandas.
I know read_csv has a usecols argument that can be callable, so when the files are csvs I can do this: pd.read_csv(filepath, usecols=lambda col: col.lower() in cols)
But read_parquet columns argument can't be callable, how can I do something similar?
答案1
得分: 2
这只是一个权宜之计,但可以使用dask
来延迟加载Parquet文件,检查列列表,选择感兴趣的列并进行实际加载(或继续以延迟方式操作)。
以下是大致的伪代码:
from dask.dataframe import read_parquet
ddf = read_parquet("some_parquet")
# 选择列
cols_of_interest = [c for c in ddf.columns if c.lower() in cols]
# 继续处理dask.dataframe
ddf = read_parquet("some_parquet", columns=cols_of_interest)
# 或者如有需要,转换为pandas数据框
df = ddf.compute()
英文:
This is a workaround only, but what one can do is use dask
to lazy-load the parquet, inspect the column list, pick the ones of interest and do the actual load (or continue in the lazy fashion).
Here's the rough pseudocode:
from dask.dataframe import read_parquet
ddf = read_parquet("some_parquet")
# select columns
cols_of_interest = [c for c in ddf.columns if c.lower() in cols]
# continue with the dask.dataframe
ddf = read_parquet("some_parquet", columns= cols_of_interest)
# or convert to pandas, if necessary
df = ddf.compute()
答案2
得分: 2
你可以使用 pyarrow
:
import pyarrow.parquet as pq
metadata = pq.read_metadata('data.parquet')
cols = ['col1', 'col3']
cols = [c for c in metadata.schema.names if c.lower() in cols]
df = pd.read_parquet('data.parquet', columns=cols)
df.columns = df.columns.str.lower()
输出:
>>> metadata.schema.names
['COL1', 'col2', 'Col3']
>>> df
col1 col3
0 9.451444 8.799611
1 3.805668 9.194838
2 1.643645 5.300303
3 4.782400 0.301559
4 8.264088 9.652009
.. ... ...
95 0.248484 2.904245
96 3.572653 6.826785
97 3.063543 8.223073
98 2.060533 9.996808
99 5.724856 3.476133
[100 rows x 2 columns]
英文:
You can use pyarrow
:
import pyarrow.parquet as pq
metadata = pq.read_metadata('data.parquet')
cols = ['col1', 'col3']
cols = [c for c in metadata.schema.names if c.lower() in cols]
df = pd.read_parquet('data.parquet', columns=cols)
df.columns = df.columns.str.lower()
Output:
>>> metadata.schema.names
['COL1', 'col2', 'Col3']
>>> df
col1 col3
0 9.451444 8.799611
1 3.805668 9.194838
2 1.643645 5.300303
3 4.782400 0.301559
4 8.264088 9.652009
.. ... ...
95 0.248484 2.904245
96 3.572653 6.826785
97 3.063543 8.223073
98 2.060533 9.996808
99 5.724856 3.476133
[100 rows x 2 columns]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论