2023年6月9日 04:15:24go评论96阅读模式

英文:

Enforcing dataypes using pandas .astype not working as expected when followed by .replace

问题

以下是翻译好的部分：

我有一个包含以下列的数据框：

Index(['tiername', 'month', 'network', 'specnewsmarket', 'year', 'adjeng',
       'hev', 'subs', 'SN Market Grp', 'Region', 'State', 'CleanStnNm',
       'Quarter', 'StnGrp', 'StnGrpOrder', 'CleanStnNm_AllStns',
       '2021 YTD HEV', '2020 YTD HEV', 'yymm']

我尝试强制这些列具有特定的数据类型，并在进行一些操作之前清理一些 'false' 输入。为此，我从一个以管道分隔的 CSV 文件中读取了映射，然后在函数中应用了该映射，如下所示：

import pandas as pd 
def force_columnDtypes(df) -> pd.DataFrame:
    """确保数据类型正确，然后再进行数据透视。"""
    dtypesMapping = pd.read_csv('references/Mappings/dtypes.txt', delimiter='|', index_col=0)['0']
    df['month'] = df['month'].astype(float) # 在转换为整数之前将月份更改为浮点数
    df = df.astype(dtypesMapping)
    df = df.replace(['FALSE', False, 'False'], 'False')
    return df

我的映射文件可以在这里找到，以及一些示例数据。

我发现 .replace 调用在某种程度上影响了我的数据类型（特别是最后一列）。例如，如果我不包括 .replace，它将按预期工作，df.dtypes 的结果为：yymm object

但在 .replace 调用之后，它不知何故将其还原为：yymm int64

我可能可以在其中硬编码数据类型，但如果有人能解释为什么会发生这种情况，那将会很好！

英文:

I have a dataframe with the following columns:

Index([&#39;tiername&#39;, &#39;month&#39;, &#39;network&#39;, &#39;specnewsmarket&#39;, &#39;year&#39;, &#39;adjeng&#39;,
       &#39;hev&#39;, &#39;subs&#39;, &#39;SN Market Grp&#39;, &#39;Region&#39;, &#39;State&#39;, &#39;CleanStnNm&#39;,
       &#39;Quarter&#39;, &#39;StnGrp&#39;, &#39;StnGrpOrder&#39;, &#39;CleanStnNm_AllStns&#39;,
       &#39;2021 YTD HEV&#39;, &#39;2020 YTD HEV&#39;, &#39;yymm&#39;]

I'm trying to enforce these columns to be of a particular data type, as well as clean up some 'false' inputs before I do some manipulations. To do this, I read in a mapping from a CSV (pipe delimited) and then apply that mapping in a function like so:

import pandas as pd 
def force_columnDtypes(df) -&gt; pd.DataFrame:
    &quot;&quot;&quot;Ensures datatypes are correct before pivoting.&quot;&quot;&quot;
    dtypesMapping = pd.read_csv(&#39;references/Mappings/dtypes.txt&#39;,delimiter =&#39;|&#39;,index_col=0)[&#39;0&#39;]
    df[&#39;month&#39;] = df[&#39;month&#39;].astype(float) #Change month to float before converting to int
    df = df.astype(dtypesMapping)
    df = df.replace([&#39;FALSE&#39;,False,&#39;False&#39;],&#39;False&#39;)
    return df

My mapping file can be found here.
Along with some sample data

What I've discovered is that the .replace call somehow messes with my datatypes (for my last column in particular). For example if I exclude the .replace it works as expected with df.dtypes resulting in: yymm object

But after the .replace call it somehow reverts it to: yymm int64

I could probably just hardcode the dtype in there, but if someone can explain to me why this is happening that would be great!

答案1

得分: 1

这可能是因为.replace()方法在替换操作后尝试推断列的最佳数据类型。由于您正在用'FALSE'，False和'False'替换为'False'，pandas 可能会推断该列应该是整数数据类型，如果列中剩余的值可以表示为整数的话。

要解决此问题，您可以在.replace()调用后再次强制指定所需的数据类型。以下是您的force_columnDtypes()函数的更新版本：

import pandas as pd
def force_columnDtypes(df) -> pd.DataFrame:
    """Ensures datatypes are correct before pivoting."""
    dtypesMapping = pd.read_csv('references/Mappings/dtypes.txt', delimiter='|', index_col=0)['0']
    df['month'] = df['month'].astype(float)  # Change month to float before converting to int
    df = df.astype(dtypesMapping)
    df = df.replace(['FALSE', False, 'False'], 'False')
    df = df.astype(dtypesMapping)  # Enforce data types again after replace
    return df

请告诉我它是否有效，因为这可能不是唯一的原因。

英文:

This is probably due to the fact that the .replace() method tries to infer the best data type for the column after the replacement operation. Since you're replacing 'FALSE', False, and 'False' with 'False', pandas might infer that the column should be of integer data type if the remaining values in the column can be represented as integers.

To solve this issue, you can enforce the desired data types again after the .replace() call. Here's an updated version of your force_columnDtypes() function:

import pandas as pd
def force_columnDtypes(df) -&gt; pd.DataFrame:
    &quot;&quot;&quot;Ensures datatypes are correct before pivoting.&quot;&quot;&quot;
    dtypesMapping = pd.read_csv(&#39;references/Mappings/dtypes.txt&#39;, delimiter=&#39;|&#39;, index_col=0)[&#39;0&#39;]
    df[&#39;month&#39;] = df[&#39;month&#39;].astype(float)  # Change month to float before converting to int
    df = df.astype(dtypesMapping)
    df = df.replace([&#39;FALSE&#39;, False, &#39;False&#39;], &#39;False&#39;)
    df = df.astype(dtypesMapping)  # Enforce data types again after replace
    return df

Let me know if it works or not because this could be not the only reason.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用pandas的.astype在后面跟着.replace时，强制数据类型不像预期那样工作。

问题

答案1

solcx为什么无法找到或使用已安装的编译器？

如何在Python中检查关键字是否存在于标题中？

转换具有各种类型数字单位的列。

Cannot open my Jupyter Notebook – it crashes my browser. Is it because of the size?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。