2023年6月8日 16:20:28go评论70阅读模式

英文:

QST: What is the canonical way to convert a column of type string[pyarrow] to boolean within a pandas dataframe?

问题

在使用 pandas 数据框 (DataFrame) 中的 string[pyarrow] 类型列转换为布尔型的过程中，您可以使用以下的规范方法：

import pandas as pd

# 创建 DataFrame
df_pyarrow = pd.DataFrame(
    {"col1": ["true", None, "false"]}, dtype="string[pyarrow]"
)

# 使用 .eq() 方法将字符串列转换为布尔列
df_pyarrow["col1"] = df_pyarrow["col1"].eq("true").astype("bool[pyarrow]")

# 验证数据类型是否为布尔类型
assert df_pyarrow.dtypes["col1"] == "bool[pyarrow]"

这种方法使用 .eq("true") 来将字符串列的每个元素与字符串 "true" 进行比较，并将结果转换为布尔类型。这个方法会更灵活，因为它可以处理多个不同的字符串值，只要它们在逻辑上等同于 True。此外，通过 .astype("bool[pyarrow]") 将数据类型明确设置为布尔类型。

尽管 df_pyarrow.info() 仍然会显示列的数据类型为 string[pyarrow]，但实际上，列的值已经被转换为布尔类型。这是因为 pyarrow 的数据类型系统会保留原始的数据类型信息，但列的内容已经是布尔值。

这就是在 pandas 中将 string[pyarrow] 类型的列转换为布尔型的规范方法。

英文:

I'm wanting to convert string data which is indeed boolean (or null) e.g. values are y / n / NA, or true / false / NA, or even a mix of these.

When using pandas with numpy backend as default, conversion from string data to boolean works smoothly:

import pandas as pd

df = pd.DataFrame({&quot;col1&quot;: [&quot;true&quot;, None, &quot;false&quot;]})
assert df.dtypes[&quot;col1&quot;] == &quot;object&quot;, df.dtypes[&quot;col1&quot;]
# convert to boolean
df[&quot;col1&quot;] = df[&quot;col1&quot;].replace({&#39;true&#39;: True, &#39;false&#39;: False}).astype(bool)
assert df.dtypes[&quot;col1&quot;] == bool, df.dtypes[&quot;col1&quot;]

However, when using the pyarrow backend (in my use case, I was actually using pd.read_parquet with dtype_backend - but I set the type explicitly in the example below):

df_pyarrow = pd.DataFrame(
    {&quot;col1&quot;: [&quot;true&quot;, None, &quot;false&quot;]}, dtype=&quot;string[pyarrow]&quot;
)
assert df_pyarrow.dtypes[&quot;col1&quot;] == &quot;string&quot;, df_pyarrow.dtypes[&quot;col1&quot;]
df_pyarrow[&quot;col1&quot;] = (
    df_pyarrow[&quot;col1&quot;]
    .replace({&#39;true&#39;: True, &#39;false&#39;: False})  # fails at this step
    .astype(bool)
)
assert df_pyarrow.dtypes[&quot;col1&quot;] == &quot;bool[pyarrow]&quot;, df_pyarrow.dtypes[&quot;col1&quot;]

but this fails at the .replace() because pyarrow complains, rightly!, that True and False are not valid values for a string[pyarrow]: TypeError: Scalar must be NA or str.

I have found that this method works:

df_pyarrow[&quot;col1&quot;] = df_pyarrow[&quot;col1&quot;] == &quot;true&quot;

assert df_pyarrow.dtypes[&quot;col1&quot;] == &quot;bool[pyarrow]&quot;, df_pyarrow.dtypes[&quot;col1&quot;]

However:

df_pyarrow.info() still says col1 is string[pyarrow]
this method isn't as flexible: what if there were multiple values for True/False

What is the canonical way to convert a column of type string[pyarrow] to boolean within a pandas dataframe?

答案1

得分: 2

你应该使用map而不是replace。

df_pyarrow["col1"] = (
    df_pyarrow["col1"]
    .map({'true': True, 'false': False})
    .astype("bool[pyarrow]")
)

这在numpy中可行，因为numpy中的字符串数组实际上是对象数组。因此，你可以用布尔值替换字符串，它们仍然是对象。

但在pandas和pyarrow后端中，类型更加严格。

英文:

You should use map instead of replace

df_pyarrow[&quot;col1&quot;] = (
    df_pyarrow[&quot;col1&quot;]
    .map({&#39;true&#39;: True, &#39;false&#39;: False})
    .astype(&quot;bool[pyarrow]&quot;)
)

It works with numpy because an array of string in numpy is actually and array of object. So you can replace strings with booleans, they are still objects.

With pandas and the pyarrow backend, it is more strict about type.

答案2

得分: 1

如果我使用：

df = (pd.read_parquet('data.parquet', dtype_backend='pyarrow')
.astype({'col1': 'bool[pyarrow]'}))


我得到：

>>> df
col1
0 True
1 <NA>
2 False

>>> df.dtypes
col1 bool[pyarrow]
dtype: object

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):

Column Non-Null Count Dtype

0 col1 2 non-null bool[pyarrow]
dtypes: boolpyarrow
memory usage: 130.0 bytes


最小可复现示例：

df = pd.DataFrame({'col1': ['true', None, 'false']})
df.to_parquet('data.parquet', engine='pyarrow')

英文:

If I use:

df = (pd.read_parquet(&#39;data.parquet&#39;, dtype_backend=&#39;pyarrow&#39;)
        .astype({&#39;col1&#39;: &#39;bool[pyarrow]&#39;}))

I get:

&gt;&gt;&gt; df
    col1
0   True
1   &lt;NA&gt;
2  False

&gt;&gt;&gt; df.dtypes
col1    bool[pyarrow]
dtype: object

&gt;&gt;&gt; df.info()
&lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype        
---  ------  --------------  -----        
 0   col1    2 non-null      bool[pyarrow]
dtypes: bool[pyarrow](1)
memory usage: 130.0 bytes

Minimal Reproducible Example:

df = pd.DataFrame({&#39;col1&#39;: [&#39;true&#39;, None, &#39;false&#39;]})
df.to_parquet(&#39;data.parquet&#39;, engine=&#39;pyarrow&#39;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

QST: What is the canonical way to convert a column of type string[pyarrow] to boolean within a pandas dataframe?

问题

答案1

答案2

Column Non-Null Count Dtype

寻找迭代 Schultz 方法

来自相同模型的多个外键。

How to resolve "django.db.utils.OperationalError: could not translate host name "db" to address: Name or service not known"

将if语句的条件赋给一个变量。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论