英文:
How to encode a string column into integers on polars?
问题
我想要以一种简单的方式将给定列(例如字符串)的值编码为任意整数标识符?
我想要对animal
和country
列进行编码,得到类似以下的结果:
shape: (4, 5)
┌──────────┬─────────┬────────┬────────────────┬─────────────────┐
│ animal ┆ country ┆ cost ┆ animal_encoded ┆ country_encoded │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 ┆ i64 │
╞══════════╪═════════╪════════╪════════════════╪═════════════════╡
│ elephant ┆ Mexico ┆ 1000.0 ┆ 0 ┆ 0 │
│ dog ┆ Denmark ┆ 20.0 ┆ 1 ┆ 1 │
│ cat ┆ Mexico ┆ 10.0 ┆ 2 ┆ 0 │
│ mouse ┆ France ┆ 120.0 ┆ 3 ┆ 2 │
└──────────┴─────────┴────────┴────────────────┴─────────────────┘
我认为可以通过从已去重的上下文中进行某种行索引,然后通过over
来扩展到相同数量的原始行来实现,但我无法实现它。
英文:
I would like to "encode" in a simple manner the values of a given column, a string for instance, into an arbitrary integer identifier?
df = (
pl.DataFrame({"animal": ['elephant', 'dog', 'cat', 'mouse'],
"country": ['Mexico', 'Denmark', 'Mexico', 'France'],
"cost": [1000.0, 20.0, 10.0, 120.0]})
)
print(df)
shape: (4, 3)
┌──────────┬─────────┬────────┐
│ animal ┆ country ┆ cost │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞══════════╪═════════╪════════╡
│ elephant ┆ Mexico ┆ 1000.0 │
│ dog ┆ Denmark ┆ 20.0 │
│ cat ┆ Mexico ┆ 10.0 │
│ mouse ┆ France ┆ 120.0 │
└──────────┴─────────┴────────┘
I would like to encode the animal
and the country
columns to get something like
shape: (4, 5)
┌──────────┬─────────┬────────┬────────────────┬─────────────────┐
│ animal ┆ country ┆ cost ┆ animal_encoded ┆ country_encoded │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 ┆ i64 │
╞══════════╪═════════╪════════╪════════════════╪═════════════════╡
│ elephant ┆ Mexico ┆ 1000.0 ┆ 0 ┆ 0 │
│ dog ┆ Denmark ┆ 20.0 ┆ 1 ┆ 1 │
│ cat ┆ Mexico ┆ 10.0 ┆ 2 ┆ 0 │
│ mouse ┆ France ┆ 120.0 ┆ 3 ┆ 2 │
└──────────┴─────────┴────────┴────────────────┴─────────────────┘
I thought that doing some sort of row indexing from a unique
d context and then over
to expand to the same number of original rows could work out but I can't manage to implement it.
答案1
得分: 1
我认为我已经找到了一个解决方案,基于这个问题的答案:这里。
df = (
pl.DataFrame({
"animal": ['elephant', 'dog', 'cat', 'mouse'],
"country": ['Mexico', 'Denmark', 'Mexico', 'France'],
"cost": [1000.0, 20.0, 10.0, 120.0]})
)
(
df
.with_columns([
pl.col(i).rank('dense').cast(pl.Int64).suffix('_encoded') for i in ['animal', 'country']
])
)
shape: (4, 5)
┌──────────┬─────────┬────────┬────────────────┬─────────────────┐
│ animal ┆ country ┆ cost ┆ animal_encoded ┆ country_encoded │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 ┆ i64 │
╞══════════╪═════════╪════════╪════════════════╪═════════════════╡
│ elephant ┆ Mexico ┆ 1000.0 ┆ 3 ┆ 3 │
│ dog ┆ Denmark ┆ 20.0 ┆ 2 ┆ 1 │
│ cat ┆ Mexico ┆ 10.0 ┆ 1 ┆ 3 │
│ mouse ┆ France ┆ 120.0 ┆ 4 ┆ 2 │
└──────────┴─────────┴────────┴────────────────┴─────────────────┘
英文:
I think I've found a solution, based on the answer from this question.
df = (
pl.DataFrame({
"animal": ['elephant', 'dog', 'cat', 'mouse'],
"country": ['Mexico', 'Denmark', 'Mexico', 'France'],
"cost": [1000.0, 20.0, 10.0, 120.0]})
)
(
df
.with_columns([
pl.col(i).rank('dense').cast(pl.Int64).suffix('_encoded') for i in ['animal', 'country']
])
)
shape: (4, 5)
┌──────────┬─────────┬────────┬────────────────┬─────────────────┐
│ animal ┆ country ┆ cost ┆ animal_encoded ┆ country_encoded │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 ┆ i64 │
╞══════════╪═════════╪════════╪════════════════╪═════════════════╡
│ elephant ┆ Mexico ┆ 1000.0 ┆ 3 ┆ 3 │
│ dog ┆ Denmark ┆ 20.0 ┆ 2 ┆ 1 │
│ cat ┆ Mexico ┆ 10.0 ┆ 1 ┆ 3 │
│ mouse ┆ France ┆ 120.0 ┆ 4 ┆ 2 │
└──────────┴─────────┴────────┴────────────────┴─────────────────┘
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论