如何在Polars中将字符串列编码为整数?

huangapple go评论61阅读模式
英文:

How to encode a string column into integers on polars?

问题

我想要以一种简单的方式将给定列(例如字符串)的值编码为任意整数标识符?

我想要对animalcountry列进行编码,得到类似以下的结果:

shape: (4, 5)
┌──────────┬─────────┬────────┬────────────────┬─────────────────┐
│ animal   ┆ country ┆ cost   ┆ animal_encoded ┆ country_encoded │
│ ---      ┆ ---     ┆ ---    ┆ ---            ┆ ---             │
│ str      ┆ str     ┆ f64    ┆ i64            ┆ i64             │
╞══════════╪═════════╪════════╪════════════════╪═════════════════╡
│ elephant ┆ Mexico  ┆ 1000.0 ┆ 0              ┆ 0               │
│ dog      ┆ Denmark ┆ 20.0   ┆ 1              ┆ 1               │
│ cat      ┆ Mexico  ┆ 10.0   ┆ 2              ┆ 0               │
│ mouse    ┆ France  ┆ 120.0  ┆ 3              ┆ 2               │
└──────────┴─────────┴────────┴────────────────┴─────────────────┘

我认为可以通过从已去重的上下文中进行某种行索引,然后通过over来扩展到相同数量的原始行来实现,但我无法实现它。

英文:

I would like to "encode" in a simple manner the values of a given column, a string for instance, into an arbitrary integer identifier?

df = (
    pl.DataFrame({"animal": ['elephant', 'dog', 'cat', 'mouse'],
    "country": ['Mexico', 'Denmark', 'Mexico', 'France'], 
    "cost": [1000.0, 20.0, 10.0, 120.0]})
)
print(df)
shape: (4, 3)
┌──────────┬─────────┬────────┐
│ animal   ┆ country ┆ cost   │
│ ---      ┆ ---     ┆ ---    │
│ str      ┆ str     ┆ f64    │
╞══════════╪═════════╪════════╡
│ elephant ┆ Mexico  ┆ 1000.0 │
│ dog      ┆ Denmark ┆ 20.0   │
│ cat      ┆ Mexico  ┆ 10.0   │
│ mouse    ┆ France  ┆ 120.0  │
└──────────┴─────────┴────────┘

I would like to encode the animal and the country columns to get something like

shape: (4, 5)
┌──────────┬─────────┬────────┬────────────────┬─────────────────┐
│ animal   ┆ country ┆ cost   ┆ animal_encoded ┆ country_encoded │
│ ---      ┆ ---     ┆ ---    ┆ ---            ┆ ---             │
│ str      ┆ str     ┆ f64    ┆ i64            ┆ i64             │
╞══════════╪═════════╪════════╪════════════════╪═════════════════╡
│ elephant ┆ Mexico  ┆ 1000.0 ┆ 0              ┆ 0               │
│ dog      ┆ Denmark ┆ 20.0   ┆ 1              ┆ 1               │
│ cat      ┆ Mexico  ┆ 10.0   ┆ 2              ┆ 0               │
│ mouse    ┆ France  ┆ 120.0  ┆ 3              ┆ 2               │
└──────────┴─────────┴────────┴────────────────┴─────────────────┘

I thought that doing some sort of row indexing from a uniqued context and then over to expand to the same number of original rows could work out but I can't manage to implement it.

答案1

得分: 1

我认为我已经找到了一个解决方案,基于这个问题的答案:这里

df = (
    pl.DataFrame({
    "animal": ['elephant', 'dog', 'cat', 'mouse'], 
    "country": ['Mexico', 'Denmark', 'Mexico', 'France'], 
    "cost": [1000.0, 20.0, 10.0, 120.0]})
)

(
     df
    .with_columns([
    pl.col(i).rank('dense').cast(pl.Int64).suffix('_encoded') for i in ['animal', 'country']
    ])

)

shape: (4, 5)
┌──────────┬─────────┬────────┬────────────────┬─────────────────┐
 animal    country  cost    animal_encoded  country_encoded 
 ---       ---      ---     ---             ---             
 str       str      f64     i64             i64             
╞══════════╪═════════╪════════╪════════════════╪═════════════════╡
 elephant  Mexico   1000.0  3               3               
 dog       Denmark  20.0    2               1               
 cat       Mexico   10.0    1               3               
 mouse     France   120.0   4               2               
└──────────┴─────────┴────────┴────────────────┴─────────────────┘
英文:

I think I've found a solution, based on the answer from this question.

df = (
    pl.DataFrame({
    "animal": ['elephant', 'dog', 'cat', 'mouse'], 
    "country": ['Mexico', 'Denmark', 'Mexico', 'France'], 
    "cost": [1000.0, 20.0, 10.0, 120.0]})
)

(
     df
    .with_columns([
    pl.col(i).rank('dense').cast(pl.Int64).suffix('_encoded') for i in ['animal', 'country']
    ])

)

shape: (4, 5)
┌──────────┬─────────┬────────┬────────────────┬─────────────────┐
│ animal   ┆ country ┆ cost   ┆ animal_encoded ┆ country_encoded │
│ ---      ┆ ---     ┆ ---    ┆ ---            ┆ ---             │
│ str      ┆ str     ┆ f64    ┆ i64            ┆ i64             │
╞══════════╪═════════╪════════╪════════════════╪═════════════════╡
│ elephant ┆ Mexico  ┆ 1000.0 ┆ 3              ┆ 3               │
│ dog      ┆ Denmark ┆ 20.0   ┆ 2              ┆ 1               │
│ cat      ┆ Mexico  ┆ 10.0   ┆ 1              ┆ 3               │
│ mouse    ┆ France  ┆ 120.0  ┆ 4              ┆ 2               │
└──────────┴─────────┴────────┴────────────────┴─────────────────┘


huangapple
  • 本文由 发表于 2023年6月19日 19:24:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76506149.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定