如何在pyarrow数据类型中使用分类数据类型?

huangapple go评论78阅读模式
英文:

How to use categorical data type with pyarrow dtypes?

问题

I'm working with the arrow dtypes with pandas, and my dataframe has a variable that should be categorical, but I can't figure out how to transform it into pyarrow data type for categorical data (dictionary).

According to pandas (https://arrow.apache.org/docs/python/pandas.html#pandas-arrow-conversion), the arrow data type I should be using is dictionary.

Usually, if you want pandas to use a pyarrow dtype you just add [pyarrow] to the name of the pyarrow type, for example dtype='string[pyarrow]'. I tried using dtype='dictionary[pyarrow]', but that yields the error:

data type 'dictionary[pyarrow]' not understood

I also tried 'categorical[pyarrow]', or 'category[pyarrow]', pyarrow.dictionary, pyarrow.dictionary(pyarrow.int16(), pyarrow.string()), and they didn't work either.

How can I use dictionary dtype on a pandas series?
pd.Series(['Chocolate', 'Candy', 'Waffles'], dtype='what_to_put_here????')

英文:

I'm working with the arrow dtypes with pandas, and my dataframe has a variable that should be categorical, but I can't figure out how to transform it into pyarrow data type for categorical data (dictionary)

According to pandas (https://arrow.apache.org/docs/python/pandas.html#pandas-arrow-conversion), the arrow data type I should be using is dictionary.

Usually, if you want pandas to use a pyarrow dtype you just add[pyarrow] to the name of the pyarrow type, for example dtype='string[pyarrow]'. I tried using dtype='dictionary[pyarrow]', but that yields the error:
> data type 'dictionary[pyarrow]' not understood

I also tried 'categorical[pyarrow]', or 'category[pyarrow]', pyarrow.dictionary, pyarrow.dictionary(pyarrow.int16(),pyarrow.string()), and they didn't work either.

How can i use dictionary dtype on a pandas series?
pd.Series(['Chocolate','Candy','Waffles'], dtype='what_to_put_here????')

答案1

得分: 4

我相信 pd.ArrowDtype 是必需的:

dtype=pd.ArrowDtype(pa.dictionary(pa.int16(), pa.string()))
英文:

I believe pd.ArrowDtype is required:

dtype=pd.ArrowDtype(pa.dictionary(pa.int16(), pa.string()))

huangapple
  • 本文由 发表于 2023年5月11日 03:34:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76222022.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定