英文:
Selecting columns based on a condition in Polars
问题
我想根据条件在 Polars DataFrame 中选择列。在我的情况下,我想选择所有具有少于 100 个唯一值的字符串列。我尝试了以下方法:
df.select((pl.col(pl.Utf8)) & (pl.all().n_unique() < 100))
这给我一个错误,这可能是由于表达式的第二部分造成的。
df.select(pl.all().n_unique() < 100)
这不会选择列,而是返回一个包含布尔值的单行 DataFrame。我对 Polars 还不是很了解,对于表达式 API 也不太明白。我做错了什么?
英文:
I want to select columns in a Polars DataFrame based on a condition. In my case, I want to select all string columns that have less than 100 unique values. Naively I tried:
df.select((pl.col(pl.Utf8)) & (pl.all().n_unique() < 100))
which gave me an error, which is probably due to the second part of the expression.
df.select(pl.all().n_unique() < 100)
This doesn't select columns but instead returns a single row DataFrame of bool values. I'm new to polars and still can't quite wrap my head around the expression API, I guess. What am I doing wrong?
答案1
得分: 6
以下是您要的翻译内容:
It's helpful if you include an example to save others from having to create one.
(如果您包含一个示例,可以帮助其他人避免创建一个。)
You are selecting the string columns with `pl.col(pl.Utf8)`
(您正在选择字符串列,使用 `pl.col(pl.Utf8)`)
You can chain `.n_unique()` to the `pl.col()` to run it just on those columns.
(您可以将 `.n_unique()` 连接到 `pl.col()` 上,以仅对这些列运行它。)
You can loop over this result and extract the `.name` for each `true` column.
(您可以遍历此结果并提取每个“true”列的 `.name`。)
There is no `.is_true()` but `.all()` is equivalent.
(没有 `.is_true()`,但 `.all()` 是等效的。)
You can then select just those columns:
(然后,您可以选择这些列。)
英文:
It's helpful if you include an example to save others from having to create one.
df = pl.DataFrame({
"col1": ["A", "B", "C", "D"],
"col2": ["A", "A", "C", "A"],
"col3": ["A", "B", "A", "B"],
"col4": [1, 2, 3, 4],
})
You are selecting the string columns with pl.col(pl.Utf8)
>>> df.select(pl.col(pl.Utf8))
shape: (4, 3)
┌──────┬──────┬──────┐
│ col1 | col2 | col3 │
│ --- | --- | --- │
│ str | str | str │
╞══════╪══════╪══════╡
│ A | A | A │
│ B | A | B │
│ C | C | A │
│ D | A | B │
└──────┴──────┴──────┘
You can chain .n_unique()
to the pl.col()
to run it just on those columns.
>>> df.select(pl.col(pl.Utf8).n_unique() < 3)
shape: (1, 3)
┌───────┬──────┬──────┐
│ col1 | col2 | col3 │
│ --- | --- | --- │
│ bool | bool | bool │
╞═══════╪══════╪══════╡
│ false | true | true │
└───────┴──────┴──────┘
You can loop over this result and extract the .name
for each true
column.
There is no .is_true()
but .all()
is equivalent.
>>> [ col.name for col in df.select(pl.col(pl.Utf8).n_unique() < 3) if col.all() ]
['col2', 'col3']
You can then select just those columns:
df.select(
col.name for col in
df.select(pl.col(pl.Utf8).n_unique() < 3)
if col.all()
)
shape: (4, 2)
┌──────┬──────┐
│ col2 | col3 │
│ --- | --- │
│ str | str │
╞══════╪══════╡
│ A | A │
│ A | B │
│ C | A │
│ A | B │
└──────┴──────┘
答案2
得分: 1
您可以通过执行melt
,然后跟随groupby
来获取列的名称,但我不太确定如何将其转化为表达式。
df = pl.DataFrame(
{
"val1": ["a", "b", "c"],
"val2": ["d", "d", "d"],
}
)
columns = (
df.select(pl.col(pl.Utf8))
.melt()
.groupby("variable")
.agg(pl.col("value").n_unique())
.filter(pl.col("value") >= 3)
.get_column("variable")
.to_list()
)
df.select(columns)
英文:
You could get the name of the columns by doing a melt
followed by a groupby
, but I'm not too sure how to turn this into an expression
df = pl.DataFrame(
{
"val1": ["a", "b", "c"],
"val2": ["d", "d", "d"],
}
)
columns = (
df.select(pl.col(pl.Utf8))
.melt()
.groupby("variable")
.agg(pl.col("value").n_unique())
.filter(pl.col("value") >= 3)
.get_column("variable")
.to_list()
)
df.select(columns)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论