Pandas:加速数据框列转换?

huangapple go评论65阅读模式
英文:

Pandas: speed up dataframe column conversion?

问题

我正在尝试创建一个实用函数,该函数允许我根据一些列规范元数据来转换 Pandas 数据框的列。


def convert_pandas_columns(df: pd.DataFrame, metadata: dict | None) -> pd.DataFrame:
    """
    Convert a pandas DataFrame `df` columns starting from a metadata dict `cols`:
    ```
    {
      "columns": {
        "col1": {
            "type": "str"
        },
        "col2": {
            "type": "datetime"
        },
        "col3": {
            "type": "float"
        }
      }
    }
    """
    
    if metadata is None or 'columns' not in metadata:
        # Skip conversion if no metadata is available
        return df

    mappings = {
        'str': lambda x: x.astype(str, errors='ignore'),
        'int': lambda x: pd.to_numeric(x, downcast='integer', errors='coerce'),
        'float': lambda x: pd.to_numeric(x, downcast='float', errors='coerce'),
        'bool': lambda x: x.astype(bool, errors='ignore'),
        'datetime': lambda x: pd.to_datetime(x, errors='coerce'),
        'list': lambda x: json.loads(x) if x else [],  # always expect a json object
    }

    dtypes = {}
    for k, v in metadata['columns'].items():
        pd_type = mappings.get(v['type'], None)
        if pd_type is not None:
            dtypes[k] = pd_type

    return df.transform(
        {
            **{
                column: lambda x: x
                for column in df.columns
            },
            **dtypes,
        },
    )

这个函数有效,但速度非常慢。有什么我可以改进性能的方法吗?之前我使用了 astype 函数,但这种方法不像这种方法那样灵活。

英文:

I am trying to make an utility function that allows me to convert columns of a generic Pandas DF based on some column specifications metadata.


def convert_pandas_columns(df: pd.DataFrame, metadata: dict | None) -> pd.DataFrame:
    """
    Convert a pandas DataFrame `df` columns starting from a metadata dict `cols`:
    ```
    {
      "columns": {
        "col1": {
            "type": "str"
        },
        "col2": {
            "type": "datetime"
        },
        "col3": {
            "type": "float"
        }
      }
    }
    ```
    """

    if metadata is None or 'columns' not in metadata:
        # Skip conversion if no metadata is available
        return df

    mappings = {
        'str': lambda x: x.astype(str, errors='ignore'),
        'int': lambda x: pd.to_numeric(x, downcast='integer', errors='coerce'),
        'float': lambda x: pd.to_numeric(x, downcast='float', errors='coerce'),
        'bool': lambda x: x.astype(bool, errors='ignore'),
        'datetime': lambda x: pd.to_datetime(x, errors='coerce'),
        'list': lambda x: json.loads(x) if x else [],  # always expect a json object
    }

    dtypes = {}
    for k, v in metadata['columns'].items():
        pd_type = mappings.get(v['type'], None)
        if pd_type is not None:
            dtypes[k] = pd_type

    return df.transform(
        {
            **{
                column: lambda x: x
                for column in df.columns
            },
            **dtypes,
        },
    )

This works but it's extremely slow. Is there anything I could change to improve its performance? Previously I was using the astype function but it's not as flexible as this approach.

答案1

得分: 2

要实现在原地转换数据框(dataframe),您可以使用以下代码:

def convert_pandas_columns(df: pd.DataFrame, metadata: dict | None) -> pd.DataFrame:
    """
    从元数据字典开始,将 pandas 数据框 `df` 的列进行转换。
    """

    if metadata is None or 'columns' not in metadata:
        # 如果没有可用的元数据,则跳过转换
        return df

    mappings = {
        'str': lambda x: x.astype(str, errors='ignore'),
        'int': lambda x: pd.to_numeric(x, downcast='integer', errors='coerce'),
        'float': lambda x: pd.to_numeric(x, downcast='float', errors='coerce'),
        'bool': lambda x: x.astype(bool, errors='ignore'),
        'datetime': lambda x: pd.to_datetime(x, errors='coerce'),
        'list': lambda x: json.loads(x) if x else [],  # 始终期望一个 JSON 对象
    }

    def create_apply_map(functions):
        def f(col):
            return functions[col.name](col)
        return f

    dtypes = {}
    for k, v in metadata['columns'].items():
        pd_type = mappings.get(v['type'], None)
        if pd_type is not None:
            dtypes[k] = pd_type

    return df.apply(create_apply_map(dtypes))

输出:

# 示例
>>> df
    col1       col2  col3  col4
0  hello 2023-06-08   3.0  12.2
1  world 2023-06-09   5.0   4.3

# 转换前
>>> df.dtypes
col1     object
col2     object
col3     object
col4    float64
dtype: object

# 转换后
>>> df.dtypes
col1            object
col2    datetime64[ns]
col3           float32
col4           float64
dtype: object

请注意,这段代码用于将数据框的列转换为不同的数据类型,根据提供的元数据字典中的类型信息。

英文:

To transform your dataframe in place, you can use:

def convert_pandas_columns(df: pd.DataFrame, metadata: dict | None) -> pd.DataFrame:
    """
    Convert a pandas DataFrame `df` columns starting from a metadata dict `cols`:
    """

    if metadata is None or 'columns' not in metadata:
        # Skip conversion if no metadata is available
        return df

    mappings = {
        'str': lambda x: x.astype(str, errors='ignore'),
        'int': lambda x: pd.to_numeric(x, downcast='integer', errors='coerce'),
        'float': lambda x: pd.to_numeric(x, downcast='float', errors='coerce'),
        'bool': lambda x: x.astype(bool, errors='ignore'),
        'datetime': lambda x: pd.to_datetime(x, errors='coerce'),
        'list': lambda x: json.loads(x) if x else [],  # always expect a json object
    }

    def create_apply_map(functions):
      def f(col):
          return functions[col.name](col)
      return f

    dtypes = {}
    for k, v in metadata['columns'].items():
        pd_type = mappings.get(v['type'], None)
        if pd_type is not None:
            dtypes[k] = convert_func
    return df.apply(create_apply_map(dtypes))

Output:

# sample
>>> df
    col1       col2  col3  col4
0  hello 2023-06-08   3.0  12.2
1  world 2023-06-09   5.0   4.3

# before
>>> df.dtypes
col1     object
col2     object
col3     object
col4    float64
dtype: object

# after
>>> df.dtypes
col1            object
col2    datetime64[ns]
col3           float32
col4           float64
dtype: object

huangapple
  • 本文由 发表于 2023年6月8日 16:52:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76430157.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定