在polars中,是否有一种方法可以从字符串列中去除字符重音?

huangapple go评论67阅读模式
英文:

In polars, is there a way to remove character accents from string columns?

问题

我想从文本列中去除字符的重音,例如将 "Piña" 转换为 "Pina"。
这是我在 pandas 中如何做的:

(names
 .str.normalize('NFKD')
 .str.encode('ascii', errors='ignore')
 .str.decode('utf-8'))

Polars 有 str.decode 和 str.encode,但似乎不是我正在寻找的。谢谢!

英文:

I want to remove character accents from a text column, ex. convert Piña to Pina.
This is how I would do it in pandas:

(names
 .str.normalize('NFKD')
 .str.encode('ascii', errors='ignore')
 .str.decode('utf-8'))

Polars has str.decode and str.encode but they don't seem to be what i'm looking for.
Thanks!

答案1

得分: 3

  1. 使用apply/lambda:

像这样:

from unicodedata import normalize 
df.with_columns(
    a=pl.col('a')
        .apply(lambda x: normalize('NFKD', x)
                        .encode('ascii', errors='ignore')
                        .decode('utf-8')))
  1. 定义函数/map:

像这样:

from unicodedata import normalize 
def custnorm(In_series):
    for i, x in enumerate(In_series):
        newvalue = normalize('NFKD', x).encode('ascii', errors='ignore').decode('utf-8')
        if newvalue != x:
            In_series[i] = newvalue
    return In_series

然后在df内部可以这样做

```python
df.with_columns(a=pl.col('a').map(custnorm))

apply和map之间的区别在于,apply告诉polars一次一行地循环,而map告诉polars将整个列作为Series传递给函数,然后函数必须返回一个相同大小的Series

英文:

To expand on @jqurious's comment you can do one of two things:

  1. apply/lambda

like this:

from unicodedata import normalize 
df.with_columns(
    a=pl.col('a')
        .apply(lambda x: normalize('NFKD',x)
                        .encode('ascii', errors='ignore')
                        .decode('utf-8')))
  1. define function/map

like this:

from unicodedata import normalize 
def custnorm(In_series):
    for i, x in enumerate(In_series):
        newvalue = normalize('NFKD',x).encode('ascii', errors='ignore').decode('utf-8')
        if newvalue != x:
            In_series[i]=newvalue
    return In_series

then inside the df you can do

df.with_columns(a=pl.col('a').map(custnorm))

The difference between apply and map is that apply tells polars to do the looping one row at a time whereas map tells polars to feed the whole column as a Series to the function which must then return a Series of the same size.

huangapple
  • 本文由 发表于 2023年6月13日 15:22:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76462548.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定