英文:
Pandas convert multiple columns datatypes at once
问题
我需要更改我的代码中两个不同DataFrame的一些列的数据类型。它们最初是Spark DataFrame,所以当我将它们直接转换为pandas时,我没有设置为分类数据类型的选项(这就是为什么我在DF创建时不这样做的原因)。
这两个DF有大部分但不是全部列都相同。我决定创建一个函数来处理这两种情况。我创建了一个逻辑,将列名和数据类型存储为字典,然后遍历字典项以转换DataFrame的列。
这些DataFrame很大(大约500k行,30-40列)。到目前为止,我找到的最好方法是使用.apply
来执行转换。
除了使用.apply
和lambda之外,是否有更快的方法可以尝试?这里是我创建的一个简单示例函数:
def dtypes_preprocess(df):
cols_cat = ['A', 'B', 'C', 'D','E']
cols_float = ['F']
cols_numerical_int32 = ['G', 'H']
dict_dtypes = {'category' : cols_cat, 'float' : cols_float, 'int32' : cols_numerical_int32}
for data_type, columns_dict in dict_dtypes.items():
existing_cols = list(set(columns_dict).intersection(df.columns)) # 检查哪些列在DataFrame中存在
df[existing_cols] = df[existing_cols].apply(lambda x: x.astype(data_type, errors = 'ignore')) # 执行转换
return df
我已经能够稍微改进了,我的第一次尝试是这样的:
for data_type, columns in dict_dtypes.items():
for col in columns:
if col in df.columns:
df[col] = df[col].astype(data_type, errors = 'ignore')
欢迎任何建议。谢谢。
英文:
I need to change the dtypes of some columns of two different DF in my code. They are originally a Spark DF, so directly when I convert them to pandas I don't have the option to set to category dtype (that's why I am doing as a secondary step, not when the DF is created).
The DF's have most, not all, of the columns in common. I decided to create only one function to handle both cases. I created a logic where I store the columns names and the dtypes as a dictionary and then I loop through the dict items to convert the columns of the dataframe.
The dataframes are large (around 500k rows, 30-40 columns). The best way I found so far was using .apply to perform the transformation.
Would there be a faster way of doing this instead of using .apply and lambda that I could try? Here's a simple example with the function I created:
def dtypes_preprocess(df):
cols_cat = ['A', 'B', 'C', 'D','E']
cols_float = ['F']
cols_numerical_int32 = ['G', 'H']
dict_dtypes = {'category' : cols_cat, 'float' : cols_float, 'int32' : cols_numerical_int32}
for data_type, columns_dict in dict_dtypes.items():
existing_cols = list(set(columns_dict).intersection(df.columns)) #check which columns from the list exist in the dataframe
df[existing_cols] = df[existing_cols].apply(lambda x: x.astype(data_type, errors = 'ignore')) #perform the transformation
return df
I was already able to improve a bit, my first attempt was like this:
for data_type, columns in dict_dtypes.items():
for col in columns:
if col in df.columns:
df[col] = df[col].astype(data_type, errors = 'ignore')
Any suggestion is appreciated. Thank you.
答案1
得分: 1
我觉得你的尝试还不错。正如@jqurious在评论中提到的(非常感谢),可以通过将映射 name -> dtype
传递给 df.astype
来完成。
所以我们只需要将 dict_dtypes
进行“反转”,并选择仅在 df
中存在的列名。
我的想法:
def dtypes_preprocess(df):
dict_dtypes = {
'category' : ['A', 'B', 'C', 'D','E'],
'float' : ['F'],
'int32' : ['G', 'H']
}
dtypes = {col: dtype for dtype in dict_dtypes for col in dict_dtypes[dtype] if col in df}
return df.astype(dtypes, errors='ignore')
字典推导看起来有点难以阅读,不过——对此欢迎任何建议。
(“反转” dict_dtypes
的想法来自于 这个问题。)
英文:
I think your attempt is not bad. As @jqurious mentioned in the comment (thanks a lot for that), it could be done by passing mapping name -> dtype
to df.astype
.
So we just need to "invert" the dict_dtypes
and choose only the column names present in df
My idea:
def dtypes_preprocess(df):
dict_dtypes = {
'category' : ['A', 'B', 'C', 'D','E'],
'float' : ['F'],
'int32' : ['G', 'H']
}
dtypes = {col: dtype for dtype in dict_dtypes for col in dict_dtypes[dtype] if col in df}
return df.astype(dtypes, errors='ignore')
The dictionary comprehension looks a bit harder to read, though -- any suggestions are welcome here.
(Idea for "inverting" the dict_dtypes
taken from this question.)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论