2023年5月28日 21:45:59go评论96阅读模式

英文:

Pandas convert multiple columns datatypes at once

问题

我需要更改我的代码中两个不同DataFrame的一些列的数据类型。它们最初是Spark DataFrame，所以当我将它们直接转换为pandas时，我没有设置为分类数据类型的选项（这就是为什么我在DF创建时不这样做的原因）。

这两个DF有大部分但不是全部列都相同。我决定创建一个函数来处理这两种情况。我创建了一个逻辑，将列名和数据类型存储为字典，然后遍历字典项以转换DataFrame的列。

这些DataFrame很大（大约500k行，30-40列）。到目前为止，我找到的最好方法是使用.apply来执行转换。

除了使用.apply和lambda之外，是否有更快的方法可以尝试？这里是我创建的一个简单示例函数：

def dtypes_preprocess(df):
  cols_cat = ['A', 'B', 'C', 'D','E']
  cols_float = ['F']
  cols_numerical_int32 = ['G', 'H']
  dict_dtypes = {'category' : cols_cat, 'float' : cols_float, 'int32' : cols_numerical_int32}
  for data_type, columns_dict in dict_dtypes.items():
    existing_cols = list(set(columns_dict).intersection(df.columns)) # 检查哪些列在DataFrame中存在
    df[existing_cols] = df[existing_cols].apply(lambda x: x.astype(data_type, errors = 'ignore')) # 执行转换
    
  return df

我已经能够稍微改进了，我的第一次尝试是这样的：

for data_type, columns in dict_dtypes.items():
  for col in columns:
    if col in df.columns:
        df[col] = df[col].astype(data_type, errors = 'ignore')

欢迎任何建议。谢谢。

英文:

I need to change the dtypes of some columns of two different DF in my code. They are originally a Spark DF, so directly when I convert them to pandas I don't have the option to set to category dtype (that's why I am doing as a secondary step, not when the DF is created).

The DF's have most, not all, of the columns in common. I decided to create only one function to handle both cases. I created a logic where I store the columns names and the dtypes as a dictionary and then I loop through the dict items to convert the columns of the dataframe.

The dataframes are large (around 500k rows, 30-40 columns). The best way I found so far was using .apply to perform the transformation.

Would there be a faster way of doing this instead of using .apply and lambda that I could try? Here's a simple example with the function I created:

def dtypes_preprocess(df):
  cols_cat = [&#39;A&#39;, &#39;B&#39;, &#39;C&#39;, &#39;D&#39;,&#39;E&#39;]
  cols_float = [&#39;F&#39;]
  cols_numerical_int32 = [&#39;G&#39;, &#39;H&#39;]
  dict_dtypes = {&#39;category&#39; : cols_cat, &#39;float&#39; : cols_float, &#39;int32&#39; : cols_numerical_int32}
  for data_type, columns_dict in dict_dtypes.items():
    existing_cols = list(set(columns_dict).intersection(df.columns)) #check which columns from the list exist in the dataframe
    df[existing_cols] = df[existing_cols].apply(lambda x: x.astype(data_type, errors = &#39;ignore&#39;)) #perform the transformation
    
  return df

I was already able to improve a bit, my first attempt was like this:

for data_type, columns in dict_dtypes.items():
  for col in columns:
    if col in df.columns:
        df[col] = df[col].astype(data_type, errors = &#39;ignore&#39;)

Any suggestion is appreciated. Thank you.

答案1

得分: 1

我觉得你的尝试还不错。正如@jqurious在评论中提到的（非常感谢），可以通过将映射 name -> dtype 传递给 df.astype 来完成。

所以我们只需要将 dict_dtypes 进行“反转”，并选择仅在 df 中存在的列名。

我的想法：

def dtypes_preprocess(df):
  dict_dtypes = {
     &#39;category&#39; : [&#39;A&#39;, &#39;B&#39;, &#39;C&#39;, &#39;D&#39;,&#39;E&#39;],
     &#39;float&#39; : [&#39;F&#39;],
     &#39;int32&#39; : [&#39;G&#39;, &#39;H&#39;]
  }
  dtypes = {col: dtype for dtype in dict_dtypes for col in dict_dtypes[dtype] if col in df}
  return df.astype(dtypes, errors=&#39;ignore&#39;)

字典推导看起来有点难以阅读，不过——对此欢迎任何建议。

（“反转” dict_dtypes 的想法来自于这个问题。）

英文:

I think your attempt is not bad. As @jqurious mentioned in the comment (thanks a lot for that), it could be done by passing mapping name -> dtype to df.astype.

So we just need to "invert" the dict_dtypes and choose only the column names present in df

My idea:

def dtypes_preprocess(df):
  dict_dtypes = {
     &#39;category&#39; : [&#39;A&#39;, &#39;B&#39;, &#39;C&#39;, &#39;D&#39;,&#39;E&#39;],
     &#39;float&#39; : [&#39;F&#39;],
     &#39;int32&#39; : [&#39;G&#39;, &#39;H&#39;]
  }
  dtypes = {col: dtype for dtype in dict_dtypes for col in dict_dtypes[dtype] if col in df}
  return df.astype(dtypes, errors=&#39;ignore&#39;)

The dictionary comprehension looks a bit harder to read, though -- any suggestions are welcome here.

(Idea for "inverting" the dict_dtypes taken from this question.)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas同时转换多列数据类型

问题

答案1

在文件名中找到相似的时间

是否在 C# List 上可以实现 “数据库索引”？

MultiIndex 在使用 pd.concat 时的名称消失

Golang – Difference between "go run main.go" and compilation

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。