2023年2月14日 19:01:50go评论71阅读模式

英文:

Selecting columns based on a condition in Polars

问题

我想根据条件在 Polars DataFrame 中选择列。在我的情况下，我想选择所有具有少于 100 个唯一值的字符串列。我尝试了以下方法：

df.select((pl.col(pl.Utf8)) & (pl.all().n_unique() < 100))

这给我一个错误，这可能是由于表达式的第二部分造成的。

df.select(pl.all().n_unique() < 100)

这不会选择列，而是返回一个包含布尔值的单行 DataFrame。我对 Polars 还不是很了解，对于表达式 API 也不太明白。我做错了什么？

英文:

I want to select columns in a Polars DataFrame based on a condition. In my case, I want to select all string columns that have less than 100 unique values. Naively I tried:

df.select((pl.col(pl.Utf8)) &amp; (pl.all().n_unique() &lt; 100))

which gave me an error, which is probably due to the second part of the expression.

df.select(pl.all().n_unique() &lt; 100)

This doesn't select columns but instead returns a single row DataFrame of bool values. I'm new to polars and still can't quite wrap my head around the expression API, I guess. What am I doing wrong?

答案1

得分: 6

以下是您要的翻译内容：

It's helpful if you include an example to save others from having to create one.
（如果您包含一个示例，可以帮助其他人避免创建一个。）

You are selecting the string columns with `pl.col(pl.Utf8)`
（您正在选择字符串列，使用 `pl.col(pl.Utf8)`）

You can chain `.n_unique()` to the `pl.col()` to run it just on those columns.
（您可以将 `.n_unique()` 连接到 `pl.col()` 上，以仅对这些列运行它。）

You can loop over this result and extract the `.name` for each `true` column.
（您可以遍历此结果并提取每个“true”列的 `.name`。）

There is no `.is_true()` but `.all()` is equivalent.
（没有 `.is_true()`，但 `.all()` 是等效的。）

You can then select just those columns:
（然后，您可以选择这些列。）

英文:

It's helpful if you include an example to save others from having to create one.

df = pl.DataFrame({
   &quot;col1&quot;: [&quot;A&quot;, &quot;B&quot;, &quot;C&quot;, &quot;D&quot;],
   &quot;col2&quot;: [&quot;A&quot;, &quot;A&quot;, &quot;C&quot;, &quot;A&quot;],
   &quot;col3&quot;: [&quot;A&quot;, &quot;B&quot;, &quot;A&quot;, &quot;B&quot;],
   &quot;col4&quot;: [1, 2, 3, 4],
})

You are selecting the string columns with pl.col(pl.Utf8)

&gt;&gt;&gt; df.select(pl.col(pl.Utf8))
shape: (4, 3)
┌──────┬──────┬──────┐
│ col1 | col2 | col3 │
│ ---  | ---  | ---  │
│ str  | str  | str  │
╞══════╪══════╪══════╡
│ A    | A    | A    │
│ B    | A    | B    │
│ C    | C    | A    │
│ D    | A    | B    │
└──────┴──────┴──────┘

You can chain .n_unique() to the pl.col() to run it just on those columns.

&gt;&gt;&gt; df.select(pl.col(pl.Utf8).n_unique() &lt; 3)
shape: (1, 3)
┌───────┬──────┬──────┐
│ col1  | col2 | col3 │
│ ---   | ---  | ---  │
│ bool  | bool | bool │
╞═══════╪══════╪══════╡
│ false | true | true │
└───────┴──────┴──────┘

You can loop over this result and extract the .name for each true column.

There is no .is_true() but .all() is equivalent.

&gt;&gt;&gt; [ col.name for col in df.select(pl.col(pl.Utf8).n_unique() &lt; 3) if col.all() ]
[&#39;col2&#39;, &#39;col3&#39;]

You can then select just those columns:

df.select(
   col.name for col in 
   df.select(pl.col(pl.Utf8).n_unique() &lt; 3) 
   if col.all()
)

shape: (4, 2)
┌──────┬──────┐
│ col2 | col3 │
│ ---  | ---  │
│ str  | str  │
╞══════╪══════╡
│ A    | A    │
│ A    | B    │
│ C    | A    │
│ A    | B    │
└──────┴──────┘

答案2

得分: 1

您可以通过执行melt，然后跟随groupby来获取列的名称，但我不太确定如何将其转化为表达式。

df = pl.DataFrame(
    {
        "val1": ["a", "b", "c"],
        "val2": ["d", "d", "d"],
    }
)
columns = (
    df.select(pl.col(pl.Utf8))
    .melt()
    .groupby("variable")
    .agg(pl.col("value").n_unique())
    .filter(pl.col("value") >= 3)
    .get_column("variable")
    .to_list()
)
df.select(columns)

英文:

You could get the name of the columns by doing a melt followed by a groupby, but I'm not too sure how to turn this into an expression

df = pl.DataFrame(
    {
        &quot;val1&quot;: [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;],
        &quot;val2&quot;: [&quot;d&quot;, &quot;d&quot;, &quot;d&quot;],
    }
)
columns = (
    df.select(pl.col(pl.Utf8))
    .melt()
    .groupby(&quot;variable&quot;)
    .agg(pl.col(&quot;value&quot;).n_unique())
    .filter(pl.col(&quot;value&quot;) &gt;= 3)
    .get_column(&quot;variable&quot;)
    .to_list()
)
df.select(columns)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

基于条件在 Polars 中选择列：

问题

答案1

答案2

zipfile.badzipfile 即使我没有使用 pandas 读取 zip 文件也会出现

我无法使用Selenium点击按钮。

如何在Python中的进程类的其他方法中使用run方法的变量

使窗格内的小部件在调整大小时拉伸。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论