2023年3月31日 22:00:01go评论107阅读模式

英文:

Multi filter by 2 columns and display some best results with Polars

问题

df = df.filter((pl.col("cid3") == pl.col("cid3").max().over(["cid1", "cid2"])) |
               (pl.col("cid3") == pl.col("cid3").nlargest(2).over(["cid1", "cid2"])))

英文:

I have df for my work with 3 main columns: cid1, cid2, cid3, and more columns cid4, cid5, etc. cid1 and cid2 is int, another columns is float.

┌──────┬──────┬──────┬──────┬──────┬──────────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6     │
╞══════╪══════╪══════╪══════╪══════╪══════════╡
│ 1    ┆ 5    ┆ 1.0  ┆ 4.0  ┆ 4.0  ┆ 1.0      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1    ┆ 5    ┆ 2.0  ┆ 5.0  ┆ 5.0  ┆ 9.0      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1    ┆ 5    ┆ 9.0  ┆ 6.0  ┆ 4.0  ┆ 9.0      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3    ┆ 7    ┆ 1.0  ┆ 7.0  ┆ 9.0  ┆ 1.0      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
| 3    ┆ 7    ┆ 3.0  ┆ 7.0  ┆ 9.0  ┆ 1.0      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3    ┆ 7    ┆ 8.0  ┆ 8.0  ┆ 3.0  ┆ 1.0      │
└──────┴──────┴──────┴──────┴──────┴──────────┘

Each combitations of cid1 and cid2 is a workset for analysis and for each workset I have some values cid3.

I can take df with only maximal values of cid3:

df = df.filter(pl.col(&quot;cid3&quot;) == pl.col(&quot;cid3&quot;).max().over([&quot;cid1&quot;, &quot;cid2&quot;]))

And this display:

┌──────┬──────┬──────┬──────┬──────┬──────────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6     │
╞══════╪══════╪══════╪══════╪══════╪══════════╡
│ 1    ┆ 5    ┆ 9.0  ┆ 6.0  ┆ 4.0  ┆ 9.0      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3    ┆ 7    ┆ 8.0  ┆ 8.0  ┆ 3.0  ┆ 1.0      │
└──────┴──────┴──────┴──────┴──────┴──────────┘

But I can't catch how I can take two maximal values of cid3 for each workset for this result:

┌──────┬──────┬──────┬──────┬──────┬──────────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6     │
╞══════╪══════╪══════╪══════╪══════╪══════════╡
│ 1    ┆ 5    ┆ 2.0  ┆ 5.0  ┆ 5.0  ┆ 9.0      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1    ┆ 5    ┆ 9.0  ┆ 6.0  ┆ 4.0  ┆ 9.0      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
| 3    ┆ 7    ┆ 3.0  ┆ 7.0  ┆ 9.0  ┆ 1.0      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3    ┆ 7    ┆ 8.0  ┆ 8.0  ┆ 3.0  ┆ 1.0      │
└──────┴──────┴──────┴──────┴──────┴──────────┘

Two maximal values of cid3 is for example, for my task I can take 10 maximal values, 5 minimal and so that. Help me please!

答案1

得分: 1

可以使用 [`.top_k()`](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.top_k.html#polars.Expr.top_k) 获取 `k` 个最大（或最小）值。
如果需要唯一值，可以使用 `.unique().top_k()`。

df.groupby(&quot;cid1&quot;, &quot;cid2&quot;).agg(pl.col(&quot;cid3&quot;).top_k(2))

形状：(2, 3)
┌──────┬──────┬────────────┐
│ cid1 ┆ cid2 ┆ cid3       │
│ ---  ┆ ---  ┆ ---        │
│ i64  ┆ i64  ┆ list[f64]  │
╞══════╪══════╪════════════╡
│ 1    ┆ 5    ┆ [9.0, 2.0] │
│ 3    ┆ 7    ┆ [8.0, 3.0] │
└──────┴──────┴────────────┘

可以在 `.filter` 中与 `.is_in` 结合使用。

df.filter(
   pl.col(&quot;cid3&quot;).is_in(pl.col(&quot;cid3&quot;).top_k(2))
     .over(&quot;cid1&quot;, &quot;cid2&quot;)
)

形状：(4, 6)
┌──────┬──────┬──────┬──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6 │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ f64  ┆ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╪══════╪══════╪══════╡
│ 1    ┆ 5    ┆ 2.0  ┆ 5.0  ┆ 5.0  ┆ 9.0  │
│ 1    ┆ 5    ┆ 9.0  ┆ 6.0  ┆ 4.0  ┆ 9.0  │
│ 3    ┆ 7    ┆ 3.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
│ 3    ┆ 7    ┆ 8.0  ┆ 8.0  ┆ 3.0  ┆ 1.0  │
└──────┴──────┴──────┴──────┴──────┴──────┘

<strike>使用 descending=True 可以找到最小值 (bottom_k)</strike>

更新： .bottom_k 已添加，将在下一版本中发布。

df.filter(
   pl.col(&quot;cid3&quot;).is_in(pl.col(&quot;cid3&quot;).bottom_k(2)
     .over(&quot;cid1&quot;, &quot;cid2&quot;)
)

形状：(4, 6)
┌──────┬──────┬──────┬──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6 │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ f64  ┆ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╪══════╪══════╪══════╡
│ 1    ┆ 5    ┆ 1.0  ┆ 4.0  ┆ 4.0  ┆ 1.0  │
│ 1    ┆ 5    ┆ 2.0  ┆ 5.0  ┆ 5.0  ┆ 9.0  │
│ 3    ┆ 7    ┆ 1.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
│ 3    ┆ 7    ┆ 3.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
└──────┴──────┴──────┴──────┴──────┴──────┘

数据框使用:

df = pl.read_csv(b&quot;&quot;&quot;
cid1,cid2,cid3,cid4,cid5,cid6
1,5,1.0,4.0,4.0,1.0
1,5,2.0,5.0,5.0,9.0
1,5,9.0,6.0,4.0,9.0
3,7,1.0,7.0,9.0,1.0
3,7,3.0,7.0,9.0,1.0
3,7,8.0,8.0,3.0,1.0
&quot;&quot;&quot;)

英文:

You can use .top_k() to get the k largest (or smallest) values.

.unique().top_k() can be used if you need distinct values.

df.groupby(&quot;cid1&quot;, &quot;cid2&quot;).agg(pl.col(&quot;cid3&quot;).top_k(2))

shape: (2, 3)
┌──────┬──────┬────────────┐
│ cid1 ┆ cid2 ┆ cid3       │
│ ---  ┆ ---  ┆ ---        │
│ i64  ┆ i64  ┆ list[f64]  │
╞══════╪══════╪════════════╡
│ 1    ┆ 5    ┆ [9.0, 2.0] │
│ 3    ┆ 7    ┆ [8.0, 3.0] │
└──────┴──────┴────────────┘

This can be used inside .filter combined with .is_in

df.filter(
   pl.col(&quot;cid3&quot;).is_in(pl.col(&quot;cid3&quot;).top_k(2))
     .over(&quot;cid1&quot;, &quot;cid2&quot;)
)

shape: (4, 6)
┌──────┬──────┬──────┬──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6 │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ f64  ┆ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╪══════╪══════╪══════╡
│ 1    ┆ 5    ┆ 2.0  ┆ 5.0  ┆ 5.0  ┆ 9.0  │
│ 1    ┆ 5    ┆ 9.0  ┆ 6.0  ┆ 4.0  ┆ 9.0  │
│ 3    ┆ 7    ┆ 3.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
│ 3    ┆ 7    ┆ 8.0  ┆ 8.0  ┆ 3.0  ┆ 1.0  │
└──────┴──────┴──────┴──────┴──────┴──────┘

<strike>descending=True to find the minimal values (bottom_k)</strike>

Update: .bottom_k has been added and will be in the next release.

df.filter(
   pl.col(&quot;cid3&quot;).is_in(pl.col(&quot;cid3&quot;).bottom_k(2)
     .over(&quot;cid1&quot;, &quot;cid2&quot;)
)

shape: (4, 6)
┌──────┬──────┬──────┬──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6 │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ f64  ┆ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╪══════╪══════╪══════╡
│ 1    ┆ 5    ┆ 1.0  ┆ 4.0  ┆ 4.0  ┆ 1.0  │
│ 1    ┆ 5    ┆ 2.0  ┆ 5.0  ┆ 5.0  ┆ 9.0  │
│ 3    ┆ 7    ┆ 1.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
│ 3    ┆ 7    ┆ 3.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
└──────┴──────┴──────┴──────┴──────┴──────┘

Dataframe used:

df = pl.read_csv(b&quot;&quot;&quot;
cid1,cid2,cid3,cid4,cid5,cid6
1,5,1.0,4.0,4.0,1.0
1,5,2.0,5.0,5.0,9.0
1,5,9.0,6.0,4.0,9.0
3,7,1.0,7.0,9.0,1.0
3,7,3.0,7.0,9.0,1.0
3,7,8.0,8.0,3.0,1.0
&quot;&quot;&quot;)

答案2

得分: 1

获取两个最大值

df.filter(
    pl.col("cid3").is_in(pl.col("cid3").unique().sort(descending=True).head(2))
    .over(["cid1", "cid2"])
    )
# 结果
shape: (4, 6)
┌──────┬──────┬──────┬──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6 │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ f64  ┆ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╪══════╪══════╪══════╡
│ 1    ┆ 5    ┆ 2.0  ┆ 5.0  ┆ 5.0  ┆ 9.0  │
│ 1    ┆ 5    ┆ 9.0  ┆ 6.0  ┆ 4.0  ┆ 9.0  │
│ 3    ┆ 7    ┆ 3.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
│ 3    ┆ 7    ┆ 8.0  ┆ 8.0  ┆ 3.0  ┆ 1.0  │
└──────┴──────┴──────┴──────┴──────┴──────┘

获取两个最小值

(df.filter(
    pl.col("cid3").is_in(pl.col("cid3").unique().sort(descending=False).head(2))
    .over(["cid1", "cid2"])
    )
# 结果
shape: (4, 6)
┌──────┬──────┬──────┬──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6 │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ f64  ┆ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╪══════╪══════╪══════╡
│ 1    ┆ 5    ┆ 1.0  ┆ 4.0  ┆ 4.0  ┆ 1.0  │
│ 1    ┆ 5    ┆ 2.0  ┆ 5.0  ┆ 5.0  ┆ 9.0  │
│ 3    ┆ 7    ┆ 1.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
│ 3    ┆ 7    ┆ 3.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
└──────┴──────┴──────┴──────┴──────┴──────┘

英文:

Here is one more possibility in case you want to get maximum or minimum values

Getting 2 largest values

df.filter(
    pl.col(&quot;cid3&quot;).is_in(pl.col(&quot;cid3&quot;).unique().sort(descending=True).head(2))
    .over([&quot;cid1&quot;, &quot;cid2&quot;])
    )
# Result
shape: (4, 6)
┌──────┬──────┬──────┬──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6 │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ f64  ┆ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╪══════╪══════╪══════╡
│ 1    ┆ 5    ┆ 2.0  ┆ 5.0  ┆ 5.0  ┆ 9.0  │
│ 1    ┆ 5    ┆ 9.0  ┆ 6.0  ┆ 4.0  ┆ 9.0  │
│ 3    ┆ 7    ┆ 3.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
│ 3    ┆ 7    ┆ 8.0  ┆ 8.0  ┆ 3.0  ┆ 1.0  │
└──────┴──────┴──────┴──────┴──────┴──────┘

Getting 2 smallest values

(df.filter(
    pl.col(&quot;cid3&quot;).is_in(pl.col(&quot;cid3&quot;).unique().sort(descending=False).head(2))
    .over([&quot;cid1&quot;, &quot;cid2&quot;])
    )
# Result
shape: (4, 6)
┌──────┬──────┬──────┬──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ cid6 │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ f64  ┆ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╪══════╪══════╪══════╡
│ 1    ┆ 5    ┆ 1.0  ┆ 4.0  ┆ 4.0  ┆ 1.0  │
│ 1    ┆ 5    ┆ 2.0  ┆ 5.0  ┆ 5.0  ┆ 9.0  │
│ 3    ┆ 7    ┆ 1.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
│ 3    ┆ 7    ┆ 3.0  ┆ 7.0  ┆ 9.0  ┆ 1.0  │
└──────┴──────┴──────┴──────┴──────┴──────┘

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

多列按2列筛选并显示一些最佳结果，使用 Polars。

问题

答案1

答案2

PyQt：如何使用QMessageBox.open()和连接回调？

如何绘制带有标签的柱状图

从桑基图中使用Python和Beautiful Soup（BS）抓取数据。

长度不匹配的值与分组分类列的索引长度不匹配

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。