Pandas同时转换多列数据类型

huangapple go评论65阅读模式
英文:

Pandas convert multiple columns datatypes at once

问题

我需要更改我的代码中两个不同DataFrame的一些列的数据类型。它们最初是Spark DataFrame,所以当我将它们直接转换为pandas时,我没有设置为分类数据类型的选项(这就是为什么我在DF创建时不这样做的原因)。

这两个DF有大部分但不是全部列都相同。我决定创建一个函数来处理这两种情况。我创建了一个逻辑,将列名和数据类型存储为字典,然后遍历字典项以转换DataFrame的列。

这些DataFrame很大(大约500k行,30-40列)。到目前为止,我找到的最好方法是使用.apply来执行转换。

除了使用.apply和lambda之外,是否有更快的方法可以尝试?这里是我创建的一个简单示例函数:

def dtypes_preprocess(df):

  cols_cat = ['A', 'B', 'C', 'D','E']
  cols_float = ['F']
  cols_numerical_int32 = ['G', 'H']

  dict_dtypes = {'category' : cols_cat, 'float' : cols_float, 'int32' : cols_numerical_int32}

  for data_type, columns_dict in dict_dtypes.items():
    existing_cols = list(set(columns_dict).intersection(df.columns)) # 检查哪些列在DataFrame中存在
    df[existing_cols] = df[existing_cols].apply(lambda x: x.astype(data_type, errors = 'ignore')) # 执行转换
    
  return df

我已经能够稍微改进了,我的第一次尝试是这样的:

for data_type, columns in dict_dtypes.items():
  for col in columns:
    if col in df.columns:
        df[col] = df[col].astype(data_type, errors = 'ignore')

欢迎任何建议。谢谢。

英文:

I need to change the dtypes of some columns of two different DF in my code. They are originally a Spark DF, so directly when I convert them to pandas I don't have the option to set to category dtype (that's why I am doing as a secondary step, not when the DF is created).

The DF's have most, not all, of the columns in common. I decided to create only one function to handle both cases. I created a logic where I store the columns names and the dtypes as a dictionary and then I loop through the dict items to convert the columns of the dataframe.

The dataframes are large (around 500k rows, 30-40 columns). The best way I found so far was using .apply to perform the transformation.

Would there be a faster way of doing this instead of using .apply and lambda that I could try? Here's a simple example with the function I created:

def dtypes_preprocess(df):

  cols_cat = ['A', 'B', 'C', 'D','E']
  cols_float = ['F']
  cols_numerical_int32 = ['G', 'H']

  dict_dtypes = {'category' : cols_cat, 'float' : cols_float, 'int32' : cols_numerical_int32}

  for data_type, columns_dict in dict_dtypes.items():
    existing_cols = list(set(columns_dict).intersection(df.columns)) #check which columns from the list exist in the dataframe
    df[existing_cols] = df[existing_cols].apply(lambda x: x.astype(data_type, errors = 'ignore')) #perform the transformation
    
  return df

I was already able to improve a bit, my first attempt was like this:

for data_type, columns in dict_dtypes.items():
  for col in columns:
    if col in df.columns:
        df[col] = df[col].astype(data_type, errors = 'ignore')

Any suggestion is appreciated. Thank you.

答案1

得分: 1

我觉得你的尝试还不错。正如@jqurious在评论中提到的(非常感谢),可以通过将映射 name -> dtype 传递给 df.astype 来完成。

所以我们只需要将 dict_dtypes 进行“反转”,并选择仅在 df 中存在的列名。

我的想法:

def dtypes_preprocess(df):
  dict_dtypes = {
     'category' : ['A', 'B', 'C', 'D','E'],
     'float' : ['F'],
     'int32' : ['G', 'H']
  }
  dtypes = {col: dtype for dtype in dict_dtypes for col in dict_dtypes[dtype] if col in df}
  return df.astype(dtypes, errors='ignore')

字典推导看起来有点难以阅读,不过——对此欢迎任何建议。

(“反转” dict_dtypes 的想法来自于 这个问题。)

英文:

I think your attempt is not bad. As @jqurious mentioned in the comment (thanks a lot for that), it could be done by passing mapping name -> dtype to df.astype.

So we just need to "invert" the dict_dtypes and choose only the column names present in df

My idea:

def dtypes_preprocess(df):
  dict_dtypes = {
     'category' : ['A', 'B', 'C', 'D','E'],
     'float' : ['F'],
     'int32' : ['G', 'H']
  }
  dtypes = {col: dtype for dtype in dict_dtypes for col in dict_dtypes[dtype] if col in df}
  return df.astype(dtypes, errors='ignore')

The dictionary comprehension looks a bit harder to read, though -- any suggestions are welcome here.

(Idea for "inverting" the dict_dtypes taken from this question.)

huangapple
  • 本文由 发表于 2023年5月28日 21:45:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76351799.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定