问题

我想要以一种简单的方式将给定列（例如字符串）的值编码为任意整数标识符？

我想要对animal和country列进行编码，得到类似以下的结果：

shape: (4, 5)
┌──────────┬─────────┬────────┬────────────────┬─────────────────┐
│ animal   ┆ country ┆ cost   ┆ animal_encoded ┆ country_encoded │
│ ---      ┆ ---     ┆ ---    ┆ ---            ┆ ---             │
│ str      ┆ str     ┆ f64    ┆ i64            ┆ i64             │
╞══════════╪═════════╪════════╪════════════════╪═════════════════╡
│ elephant ┆ Mexico  ┆ 1000.0 ┆ 0              ┆ 0               │
│ dog      ┆ Denmark ┆ 20.0   ┆ 1              ┆ 1               │
│ cat      ┆ Mexico  ┆ 10.0   ┆ 2              ┆ 0               │
│ mouse    ┆ France  ┆ 120.0  ┆ 3              ┆ 2               │
└──────────┴─────────┴────────┴────────────────┴─────────────────┘

我认为可以通过从已去重的上下文中进行某种行索引，然后通过over来扩展到相同数量的原始行来实现，但我无法实现它。

英文:

I would like to "encode" in a simple manner the values of a given column, a string for instance, into an arbitrary integer identifier?

df = (
    pl.DataFrame({&quot;animal&quot;: [&#39;elephant&#39;, &#39;dog&#39;, &#39;cat&#39;, &#39;mouse&#39;],
    &quot;country&quot;: [&#39;Mexico&#39;, &#39;Denmark&#39;, &#39;Mexico&#39;, &#39;France&#39;], 
    &quot;cost&quot;: [1000.0, 20.0, 10.0, 120.0]})
)
print(df)
shape: (4, 3)
┌──────────┬─────────┬────────┐
│ animal   ┆ country ┆ cost   │
│ ---      ┆ ---     ┆ ---    │
│ str      ┆ str     ┆ f64    │
╞══════════╪═════════╪════════╡
│ elephant ┆ Mexico  ┆ 1000.0 │
│ dog      ┆ Denmark ┆ 20.0   │
│ cat      ┆ Mexico  ┆ 10.0   │
│ mouse    ┆ France  ┆ 120.0  │
└──────────┴─────────┴────────┘

I would like to encode the animal and the country columns to get something like

shape: (4, 5)
┌──────────┬─────────┬────────┬────────────────┬─────────────────┐
│ animal   ┆ country ┆ cost   ┆ animal_encoded ┆ country_encoded │
│ ---      ┆ ---     ┆ ---    ┆ ---            ┆ ---             │
│ str      ┆ str     ┆ f64    ┆ i64            ┆ i64             │
╞══════════╪═════════╪════════╪════════════════╪═════════════════╡
│ elephant ┆ Mexico  ┆ 1000.0 ┆ 0              ┆ 0               │
│ dog      ┆ Denmark ┆ 20.0   ┆ 1              ┆ 1               │
│ cat      ┆ Mexico  ┆ 10.0   ┆ 2              ┆ 0               │
│ mouse    ┆ France  ┆ 120.0  ┆ 3              ┆ 2               │
└──────────┴─────────┴────────┴────────────────┴─────────────────┘

I thought that doing some sort of row indexing from a uniqued context and then over to expand to the same number of original rows could work out but I can't manage to implement it.

答案1

得分: 1

我认为我已经找到了一个解决方案，基于这个问题的答案：这里。

df = (
    pl.DataFrame({
    "animal": ['elephant', 'dog', 'cat', 'mouse'], 
    "country": ['Mexico', 'Denmark', 'Mexico', 'France'], 
    "cost": [1000.0, 20.0, 10.0, 120.0]})
)

(
     df
    .with_columns([
    pl.col(i).rank('dense').cast(pl.Int64).suffix('_encoded') for i in ['animal', 'country']
    ])

)

shape: (4, 5)
┌──────────┬─────────┬────────┬────────────────┬─────────────────┐
│ animal   ┆ country ┆ cost   ┆ animal_encoded ┆ country_encoded │
│ ---      ┆ ---     ┆ ---    ┆ ---            ┆ ---             │
│ str      ┆ str     ┆ f64    ┆ i64            ┆ i64             │
╞══════════╪═════════╪════════╪════════════════╪═════════════════╡
│ elephant ┆ Mexico  ┆ 1000.0 ┆ 3              ┆ 3               │
│ dog      ┆ Denmark ┆ 20.0   ┆ 2              ┆ 1               │
│ cat      ┆ Mexico  ┆ 10.0   ┆ 1              ┆ 3               │
│ mouse    ┆ France  ┆ 120.0  ┆ 4              ┆ 2               │
└──────────┴─────────┴────────┴────────────────┴─────────────────┘

英文:

I think I've found a solution, based on the answer from this question.

df = (
    pl.DataFrame({
    &quot;animal&quot;: [&#39;elephant&#39;, &#39;dog&#39;, &#39;cat&#39;, &#39;mouse&#39;], 
    &quot;country&quot;: [&#39;Mexico&#39;, &#39;Denmark&#39;, &#39;Mexico&#39;, &#39;France&#39;], 
    &quot;cost&quot;: [1000.0, 20.0, 10.0, 120.0]})
)

(
     df
    .with_columns([
    pl.col(i).rank(&#39;dense&#39;).cast(pl.Int64).suffix(&#39;_encoded&#39;) for i in [&#39;animal&#39;, &#39;country&#39;]
    ])

)

shape: (4, 5)
┌──────────┬─────────┬────────┬────────────────┬─────────────────┐
│ animal   ┆ country ┆ cost   ┆ animal_encoded ┆ country_encoded │
│ ---      ┆ ---     ┆ ---    ┆ ---            ┆ ---             │
│ str      ┆ str     ┆ f64    ┆ i64            ┆ i64             │
╞══════════╪═════════╪════════╪════════════════╪═════════════════╡
│ elephant ┆ Mexico  ┆ 1000.0 ┆ 3              ┆ 3               │
│ dog      ┆ Denmark ┆ 20.0   ┆ 2              ┆ 1               │
│ cat      ┆ Mexico  ┆ 10.0   ┆ 1              ┆ 3               │
│ mouse    ┆ France  ┆ 120.0  ┆ 4              ┆ 2               │
└──────────┴─────────┴────────┴────────────────┴─────────────────┘

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Polars中将字符串列编码为整数？

问题

答案1

将类型为list[]的列转换为字符串在polars中

在Python Polars中替换一行。

Polars的read_csv忽略错误，如果无法忽略它们该怎么办？

Python-polars: Create row per unique value in a pl.DataFrame column, columns with another, and values with a third

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论