英文:
Enforcing dataypes using pandas .astype not working as expected when followed by .replace
问题
以下是翻译好的部分:
我有一个包含以下列的数据框:
Index(['tiername', 'month', 'network', 'specnewsmarket', 'year', 'adjeng',
'hev', 'subs', 'SN Market Grp', 'Region', 'State', 'CleanStnNm',
'Quarter', 'StnGrp', 'StnGrpOrder', 'CleanStnNm_AllStns',
'2021 YTD HEV', '2020 YTD HEV', 'yymm']
我尝试强制这些列具有特定的数据类型,并在进行一些操作之前清理一些 'false' 输入。为此,我从一个以管道分隔的 CSV 文件中读取了映射,然后在函数中应用了该映射,如下所示:
import pandas as pd
def force_columnDtypes(df) -> pd.DataFrame:
"""确保数据类型正确,然后再进行数据透视。"""
dtypesMapping = pd.read_csv('references/Mappings/dtypes.txt', delimiter='|', index_col=0)['0']
df['month'] = df['month'].astype(float) # 在转换为整数之前将月份更改为浮点数
df = df.astype(dtypesMapping)
df = df.replace(['FALSE', False, 'False'], 'False')
return df
我发现 .replace 调用在某种程度上影响了我的数据类型(特别是最后一列)。例如,如果我不包括 .replace,它将按预期工作,df.dtypes 的结果为:yymm object
但在 .replace 调用之后,它不知何故将其还原为:yymm int64
我可能可以在其中硬编码数据类型,但如果有人能解释为什么会发生这种情况,那将会很好!
英文:
I have a dataframe with the following columns:
Index(['tiername', 'month', 'network', 'specnewsmarket', 'year', 'adjeng',
'hev', 'subs', 'SN Market Grp', 'Region', 'State', 'CleanStnNm',
'Quarter', 'StnGrp', 'StnGrpOrder', 'CleanStnNm_AllStns',
'2021 YTD HEV', '2020 YTD HEV', 'yymm']
I'm trying to enforce these columns to be of a particular data type, as well as clean up some 'false' inputs before I do some manipulations. To do this, I read in a mapping from a CSV (pipe delimited) and then apply that mapping in a function like so:
import pandas as pd
def force_columnDtypes(df) -> pd.DataFrame:
"""Ensures datatypes are correct before pivoting."""
dtypesMapping = pd.read_csv('references/Mappings/dtypes.txt',delimiter ='|',index_col=0)['0']
df['month'] = df['month'].astype(float) #Change month to float before converting to int
df = df.astype(dtypesMapping)
df = df.replace(['FALSE',False,'False'],'False')
return df
My mapping file can be found here.
Along with some sample data
What I've discovered is that the .replace call somehow messes with my datatypes (for my last column in particular). For example if I exclude the .replace it works as expected with df.dtypes resulting in: yymm object
But after the .replace call it somehow reverts it to: yymm int64
I could probably just hardcode the dtype in there, but if someone can explain to me why this is happening that would be great!
答案1
得分: 1
这可能是因为.replace()
方法在替换操作后尝试推断列的最佳数据类型。由于您正在用'FALSE'
,False
和'False'
替换为'False'
,pandas 可能会推断该列应该是整数数据类型,如果列中剩余的值可以表示为整数的话。
要解决此问题,您可以在.replace()
调用后再次强制指定所需的数据类型。以下是您的force_columnDtypes()
函数的更新版本:
import pandas as pd
def force_columnDtypes(df) -> pd.DataFrame:
"""Ensures datatypes are correct before pivoting."""
dtypesMapping = pd.read_csv('references/Mappings/dtypes.txt', delimiter='|', index_col=0)['0']
df['month'] = df['month'].astype(float) # Change month to float before converting to int
df = df.astype(dtypesMapping)
df = df.replace(['FALSE', False, 'False'], 'False')
df = df.astype(dtypesMapping) # Enforce data types again after replace
return df
请告诉我它是否有效,因为这可能不是唯一的原因。
英文:
This is probably due to the fact that the .replace()
method tries to infer the best data type for the column after the replacement operation. Since you're replacing 'FALSE', False, and 'False' with 'False', pandas might infer that the column should be of integer data type if the remaining values in the column can be represented as integers.
To solve this issue, you can enforce the desired data types again after the .replace()
call. Here's an updated version of your force_columnDtypes()
function:
import pandas as pd
def force_columnDtypes(df) -> pd.DataFrame:
"""Ensures datatypes are correct before pivoting."""
dtypesMapping = pd.read_csv('references/Mappings/dtypes.txt', delimiter='|', index_col=0)['0']
df['month'] = df['month'].astype(float) # Change month to float before converting to int
df = df.astype(dtypesMapping)
df = df.replace(['FALSE', False, 'False'], 'False')
df = df.astype(dtypesMapping) # Enforce data types again after replace
return df
Let me know if it works or not because this could be not the only reason.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论