英文:
Pandas: speed up dataframe column conversion?
问题
我正在尝试创建一个实用函数,该函数允许我根据一些列规范元数据来转换 Pandas 数据框的列。
def convert_pandas_columns(df: pd.DataFrame, metadata: dict | None) -> pd.DataFrame:
"""
Convert a pandas DataFrame `df` columns starting from a metadata dict `cols`:
```
{
"columns": {
"col1": {
"type": "str"
},
"col2": {
"type": "datetime"
},
"col3": {
"type": "float"
}
}
}
"""
if metadata is None or 'columns' not in metadata:
# Skip conversion if no metadata is available
return df
mappings = {
'str': lambda x: x.astype(str, errors='ignore'),
'int': lambda x: pd.to_numeric(x, downcast='integer', errors='coerce'),
'float': lambda x: pd.to_numeric(x, downcast='float', errors='coerce'),
'bool': lambda x: x.astype(bool, errors='ignore'),
'datetime': lambda x: pd.to_datetime(x, errors='coerce'),
'list': lambda x: json.loads(x) if x else [], # always expect a json object
}
dtypes = {}
for k, v in metadata['columns'].items():
pd_type = mappings.get(v['type'], None)
if pd_type is not None:
dtypes[k] = pd_type
return df.transform(
{
**{
column: lambda x: x
for column in df.columns
},
**dtypes,
},
)
这个函数有效,但速度非常慢。有什么我可以改进性能的方法吗?之前我使用了 astype
函数,但这种方法不像这种方法那样灵活。
英文:
I am trying to make an utility function that allows me to convert columns of a generic Pandas DF based on some column specifications metadata.
def convert_pandas_columns(df: pd.DataFrame, metadata: dict | None) -> pd.DataFrame:
"""
Convert a pandas DataFrame `df` columns starting from a metadata dict `cols`:
```
{
"columns": {
"col1": {
"type": "str"
},
"col2": {
"type": "datetime"
},
"col3": {
"type": "float"
}
}
}
```
"""
if metadata is None or 'columns' not in metadata:
# Skip conversion if no metadata is available
return df
mappings = {
'str': lambda x: x.astype(str, errors='ignore'),
'int': lambda x: pd.to_numeric(x, downcast='integer', errors='coerce'),
'float': lambda x: pd.to_numeric(x, downcast='float', errors='coerce'),
'bool': lambda x: x.astype(bool, errors='ignore'),
'datetime': lambda x: pd.to_datetime(x, errors='coerce'),
'list': lambda x: json.loads(x) if x else [], # always expect a json object
}
dtypes = {}
for k, v in metadata['columns'].items():
pd_type = mappings.get(v['type'], None)
if pd_type is not None:
dtypes[k] = pd_type
return df.transform(
{
**{
column: lambda x: x
for column in df.columns
},
**dtypes,
},
)
This works but it's extremely slow. Is there anything I could change to improve its performance? Previously I was using the astype
function but it's not as flexible as this approach.
答案1
得分: 2
要实现在原地转换数据框(dataframe),您可以使用以下代码:
def convert_pandas_columns(df: pd.DataFrame, metadata: dict | None) -> pd.DataFrame:
"""
从元数据字典开始,将 pandas 数据框 `df` 的列进行转换。
"""
if metadata is None or 'columns' not in metadata:
# 如果没有可用的元数据,则跳过转换
return df
mappings = {
'str': lambda x: x.astype(str, errors='ignore'),
'int': lambda x: pd.to_numeric(x, downcast='integer', errors='coerce'),
'float': lambda x: pd.to_numeric(x, downcast='float', errors='coerce'),
'bool': lambda x: x.astype(bool, errors='ignore'),
'datetime': lambda x: pd.to_datetime(x, errors='coerce'),
'list': lambda x: json.loads(x) if x else [], # 始终期望一个 JSON 对象
}
def create_apply_map(functions):
def f(col):
return functions[col.name](col)
return f
dtypes = {}
for k, v in metadata['columns'].items():
pd_type = mappings.get(v['type'], None)
if pd_type is not None:
dtypes[k] = pd_type
return df.apply(create_apply_map(dtypes))
输出:
# 示例
>>> df
col1 col2 col3 col4
0 hello 2023-06-08 3.0 12.2
1 world 2023-06-09 5.0 4.3
# 转换前
>>> df.dtypes
col1 object
col2 object
col3 object
col4 float64
dtype: object
# 转换后
>>> df.dtypes
col1 object
col2 datetime64[ns]
col3 float32
col4 float64
dtype: object
请注意,这段代码用于将数据框的列转换为不同的数据类型,根据提供的元数据字典中的类型信息。
英文:
To transform your dataframe in place, you can use:
def convert_pandas_columns(df: pd.DataFrame, metadata: dict | None) -> pd.DataFrame:
"""
Convert a pandas DataFrame `df` columns starting from a metadata dict `cols`:
"""
if metadata is None or 'columns' not in metadata:
# Skip conversion if no metadata is available
return df
mappings = {
'str': lambda x: x.astype(str, errors='ignore'),
'int': lambda x: pd.to_numeric(x, downcast='integer', errors='coerce'),
'float': lambda x: pd.to_numeric(x, downcast='float', errors='coerce'),
'bool': lambda x: x.astype(bool, errors='ignore'),
'datetime': lambda x: pd.to_datetime(x, errors='coerce'),
'list': lambda x: json.loads(x) if x else [], # always expect a json object
}
def create_apply_map(functions):
def f(col):
return functions[col.name](col)
return f
dtypes = {}
for k, v in metadata['columns'].items():
pd_type = mappings.get(v['type'], None)
if pd_type is not None:
dtypes[k] = convert_func
return df.apply(create_apply_map(dtypes))
Output:
# sample
>>> df
col1 col2 col3 col4
0 hello 2023-06-08 3.0 12.2
1 world 2023-06-09 5.0 4.3
# before
>>> df.dtypes
col1 object
col2 object
col3 object
col4 float64
dtype: object
# after
>>> df.dtypes
col1 object
col2 datetime64[ns]
col3 float32
col4 float64
dtype: object
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论