基于条件在 Polars 中选择列:

huangapple go评论59阅读模式
英文:

Selecting columns based on a condition in Polars

问题

我想根据条件在 Polars DataFrame 中选择列。在我的情况下,我想选择所有具有少于 100 个唯一值的字符串列。我尝试了以下方法:

df.select((pl.col(pl.Utf8)) & (pl.all().n_unique() < 100))

这给我一个错误,这可能是由于表达式的第二部分造成的。

df.select(pl.all().n_unique() < 100)

这不会选择列,而是返回一个包含布尔值的单行 DataFrame。我对 Polars 还不是很了解,对于表达式 API 也不太明白。我做错了什么?

英文:

I want to select columns in a Polars DataFrame based on a condition. In my case, I want to select all string columns that have less than 100 unique values. Naively I tried:

df.select((pl.col(pl.Utf8)) &amp; (pl.all().n_unique() &lt; 100))

which gave me an error, which is probably due to the second part of the expression.

df.select(pl.all().n_unique() &lt; 100)

This doesn't select columns but instead returns a single row DataFrame of bool values. I'm new to polars and still can't quite wrap my head around the expression API, I guess. What am I doing wrong?

答案1

得分: 6

以下是您要的翻译内容:

It's helpful if you include an example to save others from having to create one.
(如果您包含一个示例,可以帮助其他人避免创建一个。)

You are selecting the string columns with `pl.col(pl.Utf8)`
(您正在选择字符串列,使用 `pl.col(pl.Utf8)`)

You can chain `.n_unique()` to the `pl.col()` to run it just on those columns.
(您可以将 `.n_unique()` 连接到 `pl.col()` 上,以仅对这些列运行它。)

You can loop over this result and extract the `.name` for each `true` column.
(您可以遍历此结果并提取每个“true”列的 `.name`。)

There is no `.is_true()` but `.all()` is equivalent.
(没有 `.is_true()`,但 `.all()` 是等效的。)

You can then select just those columns:
(然后,您可以选择这些列。)
英文:

It's helpful if you include an example to save others from having to create one.

df = pl.DataFrame({
   &quot;col1&quot;: [&quot;A&quot;, &quot;B&quot;, &quot;C&quot;, &quot;D&quot;],
   &quot;col2&quot;: [&quot;A&quot;, &quot;A&quot;, &quot;C&quot;, &quot;A&quot;],
   &quot;col3&quot;: [&quot;A&quot;, &quot;B&quot;, &quot;A&quot;, &quot;B&quot;],
   &quot;col4&quot;: [1, 2, 3, 4],
})

You are selecting the string columns with pl.col(pl.Utf8)

&gt;&gt;&gt; df.select(pl.col(pl.Utf8))
shape: (4, 3)
┌──────┬──────┬──────┐
│ col1 | col2 | col3 │
│ ---  | ---  | ---  │
│ str  | str  | str  │
╞══════╪══════╪══════╡
│ A    | A    | A    │
│ B    | A    | B    │
│ C    | C    | A    │
│ D    | A    | B    │
└──────┴──────┴──────┘

You can chain .n_unique() to the pl.col() to run it just on those columns.

&gt;&gt;&gt; df.select(pl.col(pl.Utf8).n_unique() &lt; 3)
shape: (1, 3)
┌───────┬──────┬──────┐
│ col1  | col2 | col3 │
│ ---   | ---  | ---  │
│ bool  | bool | bool │
╞═══════╪══════╪══════╡
│ false | true | true │
└───────┴──────┴──────┘

You can loop over this result and extract the .name for each true column.

There is no .is_true() but .all() is equivalent.

&gt;&gt;&gt; [ col.name for col in df.select(pl.col(pl.Utf8).n_unique() &lt; 3) if col.all() ]
[&#39;col2&#39;, &#39;col3&#39;]

You can then select just those columns:

df.select(
   col.name for col in 
   df.select(pl.col(pl.Utf8).n_unique() &lt; 3) 
   if col.all()
)
shape: (4, 2)
┌──────┬──────┐
│ col2 | col3 │
│ ---  | ---  │
│ str  | str  │
╞══════╪══════╡
│ A    | A    │
│ A    | B    │
│ C    | A    │
│ A    | B    │
└──────┴──────┘

答案2

得分: 1

您可以通过执行melt,然后跟随groupby来获取列的名称,但我不太确定如何将其转化为表达式。

df = pl.DataFrame(
    {
        "val1": ["a", "b", "c"],
        "val2": ["d", "d", "d"],
    }
)
columns = (
    df.select(pl.col(pl.Utf8))
    .melt()
    .groupby("variable")
    .agg(pl.col("value").n_unique())
    .filter(pl.col("value") >= 3)
    .get_column("variable")
    .to_list()
)
df.select(columns)
英文:

You could get the name of the columns by doing a melt followed by a groupby, but I'm not too sure how to turn this into an expression

df = pl.DataFrame(
    {
        &quot;val1&quot;: [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;],
        &quot;val2&quot;: [&quot;d&quot;, &quot;d&quot;, &quot;d&quot;],
    }
)
columns = (
    df.select(pl.col(pl.Utf8))
    .melt()
    .groupby(&quot;variable&quot;)
    .agg(pl.col(&quot;value&quot;).n_unique())
    .filter(pl.col(&quot;value&quot;) &gt;= 3)
    .get_column(&quot;variable&quot;)
    .to_list()
)
df.select(columns)

huangapple
  • 本文由 发表于 2023年2月14日 19:01:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/75446886.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定