QST: What is the canonical way to convert a column of type string[pyarrow] to boolean within a pandas dataframe?

huangapple go评论70阅读模式
英文:

QST: What is the canonical way to convert a column of type string[pyarrow] to boolean within a pandas dataframe?

问题

在使用 pandas 数据框 (DataFrame) 中的 string[pyarrow] 类型列转换为布尔型的过程中,您可以使用以下的规范方法:

import pandas as pd

# 创建 DataFrame
df_pyarrow = pd.DataFrame(
    {"col1": ["true", None, "false"]}, dtype="string[pyarrow]"
)

# 使用 .eq() 方法将字符串列转换为布尔列
df_pyarrow["col1"] = df_pyarrow["col1"].eq("true").astype("bool[pyarrow]")

# 验证数据类型是否为布尔类型
assert df_pyarrow.dtypes["col1"] == "bool[pyarrow]"

这种方法使用 .eq("true") 来将字符串列的每个元素与字符串 "true" 进行比较,并将结果转换为布尔类型。这个方法会更灵活,因为它可以处理多个不同的字符串值,只要它们在逻辑上等同于 True。此外,通过 .astype("bool[pyarrow]") 将数据类型明确设置为布尔类型。

尽管 df_pyarrow.info() 仍然会显示列的数据类型为 string[pyarrow],但实际上,列的值已经被转换为布尔类型。这是因为 pyarrow 的数据类型系统会保留原始的数据类型信息,但列的内容已经是布尔值。

这就是在 pandas 中将 string[pyarrow] 类型的列转换为布尔型的规范方法。

英文:

I'm wanting to convert string data which is indeed boolean (or null) e.g. values are y / n / NA, or true / false / NA, or even a mix of these.

When using pandas with numpy backend as default, conversion from string data to boolean works smoothly:

import pandas as pd

df = pd.DataFrame({"col1": ["true", None, "false"]})
assert df.dtypes["col1"] == "object", df.dtypes["col1"]
# convert to boolean
df["col1"] = df["col1"].replace({'true': True, 'false': False}).astype(bool)
assert df.dtypes["col1"] == bool, df.dtypes["col1"]

However, when using the pyarrow backend (in my use case, I was actually using pd.read_parquet with dtype_backend - but I set the type explicitly in the example below):

df_pyarrow = pd.DataFrame(
    {"col1": ["true", None, "false"]}, dtype="string[pyarrow]"
)
assert df_pyarrow.dtypes["col1"] == "string", df_pyarrow.dtypes["col1"]
df_pyarrow["col1"] = (
    df_pyarrow["col1"]
    .replace({'true': True, 'false': False})  # fails at this step
    .astype(bool)
)
assert df_pyarrow.dtypes["col1"] == "bool[pyarrow]", df_pyarrow.dtypes["col1"]

but this fails at the .replace() because pyarrow complains, rightly!, that True and False are not valid values for a string[pyarrow]: TypeError: Scalar must be NA or str.

I have found that this method works:

df_pyarrow["col1"] = df_pyarrow["col1"] == "true"

assert df_pyarrow.dtypes["col1"] == "bool[pyarrow]", df_pyarrow.dtypes["col1"]

However:

  • df_pyarrow.info() still says col1 is string[pyarrow]
  • this method isn't as flexible: what if there were multiple values for True/False

What is the canonical way to convert a column of type string[pyarrow] to boolean within a pandas dataframe?

答案1

得分: 2

你应该使用map而不是replace

df_pyarrow["col1"] = (
    df_pyarrow["col1"]
    .map({'true': True, 'false': False})
    .astype("bool[pyarrow]")
)

这在numpy中可行,因为numpy中的字符串数组实际上是对象数组。因此,你可以用布尔值替换字符串,它们仍然是对象。

但在pandas和pyarrow后端中,类型更加严格。

英文:

You should use map instead of replace

df_pyarrow["col1"] = (
    df_pyarrow["col1"]
    .map({'true': True, 'false': False})
    .astype("bool[pyarrow]")
)

It works with numpy because an array of string in numpy is actually and array of object. So you can replace strings with booleans, they are still objects.

With pandas and the pyarrow backend, it is more strict about type.

答案2

得分: 1

如果我使用:

df = (pd.read_parquet('data.parquet', dtype_backend='pyarrow')
.astype({'col1': 'bool[pyarrow]'}))


我得到:

>>> df
col1
0 True
1 <NA>
2 False

>>> df.dtypes
col1 bool[pyarrow]
dtype: object

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):

Column Non-Null Count Dtype


0 col1 2 non-null bool[pyarrow]
dtypes: boolpyarrow
memory usage: 130.0 bytes


最小可复现示例:

df = pd.DataFrame({'col1': ['true', None, 'false']})
df.to_parquet('data.parquet', engine='pyarrow')

英文:

If I use:

df = (pd.read_parquet(&#39;data.parquet&#39;, dtype_backend=&#39;pyarrow&#39;)
        .astype({&#39;col1&#39;: &#39;bool[pyarrow]&#39;}))

I get:

&gt;&gt;&gt; df
    col1
0   True
1   &lt;NA&gt;
2  False

&gt;&gt;&gt; df.dtypes
col1    bool[pyarrow]
dtype: object

&gt;&gt;&gt; df.info()
&lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype        
---  ------  --------------  -----        
 0   col1    2 non-null      bool[pyarrow]
dtypes: bool[pyarrow](1)
memory usage: 130.0 bytes

Minimal Reproducible Example:

df = pd.DataFrame({&#39;col1&#39;: [&#39;true&#39;, None, &#39;false&#39;]})
df.to_parquet(&#39;data.parquet&#39;, engine=&#39;pyarrow&#39;)

huangapple
  • 本文由 发表于 2023年6月8日 16:20:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76429921.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定