英文:
QST: What is the canonical way to convert a column of type string[pyarrow] to boolean within a pandas dataframe?
问题
在使用 pandas 数据框 (DataFrame) 中的 string[pyarrow]
类型列转换为布尔型的过程中,您可以使用以下的规范方法:
import pandas as pd
# 创建 DataFrame
df_pyarrow = pd.DataFrame(
{"col1": ["true", None, "false"]}, dtype="string[pyarrow]"
)
# 使用 .eq() 方法将字符串列转换为布尔列
df_pyarrow["col1"] = df_pyarrow["col1"].eq("true").astype("bool[pyarrow]")
# 验证数据类型是否为布尔类型
assert df_pyarrow.dtypes["col1"] == "bool[pyarrow]"
这种方法使用 .eq("true")
来将字符串列的每个元素与字符串 "true" 进行比较,并将结果转换为布尔类型。这个方法会更灵活,因为它可以处理多个不同的字符串值,只要它们在逻辑上等同于 True
。此外,通过 .astype("bool[pyarrow]")
将数据类型明确设置为布尔类型。
尽管 df_pyarrow.info()
仍然会显示列的数据类型为 string[pyarrow]
,但实际上,列的值已经被转换为布尔类型。这是因为 pyarrow 的数据类型系统会保留原始的数据类型信息,但列的内容已经是布尔值。
这就是在 pandas 中将 string[pyarrow]
类型的列转换为布尔型的规范方法。
英文:
I'm wanting to convert string data which is indeed boolean (or null) e.g. values are y / n / NA, or true / false / NA, or even a mix of these.
When using pandas with numpy
backend as default, conversion from string data to boolean works smoothly:
import pandas as pd
df = pd.DataFrame({"col1": ["true", None, "false"]})
assert df.dtypes["col1"] == "object", df.dtypes["col1"]
# convert to boolean
df["col1"] = df["col1"].replace({'true': True, 'false': False}).astype(bool)
assert df.dtypes["col1"] == bool, df.dtypes["col1"]
However, when using the pyarrow backend (in my use case, I was actually using pd.read_parquet
with dtype_backend
- but I set the type explicitly in the example below):
df_pyarrow = pd.DataFrame(
{"col1": ["true", None, "false"]}, dtype="string[pyarrow]"
)
assert df_pyarrow.dtypes["col1"] == "string", df_pyarrow.dtypes["col1"]
df_pyarrow["col1"] = (
df_pyarrow["col1"]
.replace({'true': True, 'false': False}) # fails at this step
.astype(bool)
)
assert df_pyarrow.dtypes["col1"] == "bool[pyarrow]", df_pyarrow.dtypes["col1"]
but this fails at the .replace() because pyarrow complains, rightly!, that True
and False
are not valid values for a string[pyarrow]: TypeError: Scalar must be NA or str
.
I have found that this method works:
df_pyarrow["col1"] = df_pyarrow["col1"] == "true"
assert df_pyarrow.dtypes["col1"] == "bool[pyarrow]", df_pyarrow.dtypes["col1"]
However:
df_pyarrow.info()
still sayscol1
isstring[pyarrow]
- this method isn't as flexible: what if there were multiple values for True/False
What is the canonical way to convert a column of type string[pyarrow]
to boolean within a pandas dataframe?
答案1
得分: 2
你应该使用map
而不是replace
。
df_pyarrow["col1"] = (
df_pyarrow["col1"]
.map({'true': True, 'false': False})
.astype("bool[pyarrow]")
)
这在numpy中可行,因为numpy中的字符串数组实际上是对象数组。因此,你可以用布尔值替换字符串,它们仍然是对象。
但在pandas和pyarrow后端中,类型更加严格。
英文:
You should use map instead of replace
df_pyarrow["col1"] = (
df_pyarrow["col1"]
.map({'true': True, 'false': False})
.astype("bool[pyarrow]")
)
It works with numpy because an array of string in numpy is actually and array of object. So you can replace strings with booleans, they are still objects.
With pandas and the pyarrow backend, it is more strict about type.
答案2
得分: 1
如果我使用:
df = (pd.read_parquet('data.parquet', dtype_backend='pyarrow')
.astype({'col1': 'bool[pyarrow]'}))
我得到:
>>> df
col1
0 True
1 <NA>
2 False
>>> df.dtypes
col1 bool[pyarrow]
dtype: object
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
Column Non-Null Count Dtype
0 col1 2 non-null bool[pyarrow]
dtypes: boolpyarrow
memory usage: 130.0 bytes
最小可复现示例:
df = pd.DataFrame({'col1': ['true', None, 'false']})
df.to_parquet('data.parquet', engine='pyarrow')
英文:
If I use:
df = (pd.read_parquet('data.parquet', dtype_backend='pyarrow')
.astype({'col1': 'bool[pyarrow]'}))
I get:
>>> df
col1
0 True
1 <NA>
2 False
>>> df.dtypes
col1 bool[pyarrow]
dtype: object
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 2 non-null bool[pyarrow]
dtypes: bool[pyarrow](1)
memory usage: 130.0 bytes
Minimal Reproducible Example:
df = pd.DataFrame({'col1': ['true', None, 'false']})
df.to_parquet('data.parquet', engine='pyarrow')
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论