使用pandas的.astype在后面跟着.replace时,强制数据类型不像预期那样工作。

huangapple go评论69阅读模式
英文:

Enforcing dataypes using pandas .astype not working as expected when followed by .replace

问题

以下是翻译好的部分:

我有一个包含以下列的数据框:

Index(['tiername', 'month', 'network', 'specnewsmarket', 'year', 'adjeng',
       'hev', 'subs', 'SN Market Grp', 'Region', 'State', 'CleanStnNm',
       'Quarter', 'StnGrp', 'StnGrpOrder', 'CleanStnNm_AllStns',
       '2021 YTD HEV', '2020 YTD HEV', 'yymm']

我尝试强制这些列具有特定的数据类型,并在进行一些操作之前清理一些 'false' 输入。为此,我从一个以管道分隔的 CSV 文件中读取了映射,然后在函数中应用了该映射,如下所示:

import pandas as pd 
def force_columnDtypes(df) -> pd.DataFrame:
    """确保数据类型正确,然后再进行数据透视。"""
    dtypesMapping = pd.read_csv('references/Mappings/dtypes.txt', delimiter='|', index_col=0)['0']
    df['month'] = df['month'].astype(float) # 在转换为整数之前将月份更改为浮点数
    df = df.astype(dtypesMapping)
    df = df.replace(['FALSE', False, 'False'], 'False')
    return df

我的映射文件可以在这里找到,以及一些示例数据

我发现 .replace 调用在某种程度上影响了我的数据类型(特别是最后一列)。例如,如果我不包括 .replace,它将按预期工作,df.dtypes 的结果为:yymm object

但在 .replace 调用之后,它不知何故将其还原为:yymm int64

我可能可以在其中硬编码数据类型,但如果有人能解释为什么会发生这种情况,那将会很好!

英文:

I have a dataframe with the following columns:

Index(['tiername', 'month', 'network', 'specnewsmarket', 'year', 'adjeng',
       'hev', 'subs', 'SN Market Grp', 'Region', 'State', 'CleanStnNm',
       'Quarter', 'StnGrp', 'StnGrpOrder', 'CleanStnNm_AllStns',
       '2021 YTD HEV', '2020 YTD HEV', 'yymm']

I'm trying to enforce these columns to be of a particular data type, as well as clean up some 'false' inputs before I do some manipulations. To do this, I read in a mapping from a CSV (pipe delimited) and then apply that mapping in a function like so:

import pandas as pd 
def force_columnDtypes(df) -> pd.DataFrame:
    """Ensures datatypes are correct before pivoting."""
    dtypesMapping = pd.read_csv('references/Mappings/dtypes.txt',delimiter ='|',index_col=0)['0']
    df['month'] = df['month'].astype(float) #Change month to float before converting to int
    df = df.astype(dtypesMapping)
    df = df.replace(['FALSE',False,'False'],'False')
    return df

My mapping file can be found here.
Along with some sample data

What I've discovered is that the .replace call somehow messes with my datatypes (for my last column in particular). For example if I exclude the .replace it works as expected with df.dtypes resulting in: yymm object

But after the .replace call it somehow reverts it to: yymm int64

I could probably just hardcode the dtype in there, but if someone can explain to me why this is happening that would be great!

答案1

得分: 1

这可能是因为.replace()方法在替换操作后尝试推断列的最佳数据类型。由于您正在用'FALSE'False'False'替换为'False',pandas 可能会推断该列应该是整数数据类型,如果列中剩余的值可以表示为整数的话。

要解决此问题,您可以在.replace()调用后再次强制指定所需的数据类型。以下是您的force_columnDtypes()函数的更新版本:

import pandas as pd

def force_columnDtypes(df) -> pd.DataFrame:
    """Ensures datatypes are correct before pivoting."""
    dtypesMapping = pd.read_csv('references/Mappings/dtypes.txt', delimiter='|', index_col=0)['0']
    df['month'] = df['month'].astype(float)  # Change month to float before converting to int
    df = df.astype(dtypesMapping)
    df = df.replace(['FALSE', False, 'False'], 'False')
    df = df.astype(dtypesMapping)  # Enforce data types again after replace
    return df

请告诉我它是否有效,因为这可能不是唯一的原因。

英文:

This is probably due to the fact that the .replace() method tries to infer the best data type for the column after the replacement operation. Since you're replacing 'FALSE', False, and 'False' with 'False', pandas might infer that the column should be of integer data type if the remaining values in the column can be represented as integers.

To solve this issue, you can enforce the desired data types again after the .replace() call. Here's an updated version of your force_columnDtypes() function:

import pandas as pd

def force_columnDtypes(df) -> pd.DataFrame:
    """Ensures datatypes are correct before pivoting."""
    dtypesMapping = pd.read_csv('references/Mappings/dtypes.txt', delimiter='|', index_col=0)['0']
    df['month'] = df['month'].astype(float)  # Change month to float before converting to int
    df = df.astype(dtypesMapping)
    df = df.replace(['FALSE', False, 'False'], 'False')
    df = df.astype(dtypesMapping)  # Enforce data types again after replace
    return df

Let me know if it works or not because this could be not the only reason.

huangapple
  • 本文由 发表于 2023年6月9日 04:15:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76435409.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定