多列按2列筛选并显示一些最佳结果,使用 Polars。

huangapple go评论97阅读模式
英文:

Multi filter by 2 columns and display some best results with Polars

问题

  1. df = df.filter((pl.col("cid3") == pl.col("cid3").max().over(["cid1", "cid2"])) |
  2. (pl.col("cid3") == pl.col("cid3").nlargest(2).over(["cid1", "cid2"])))
英文:

I have df for my work with 3 main columns: cid1, cid2, cid3, and more columns cid4, cid5, etc. cid1 and cid2 is int, another columns is float.

  1. ┌──────┬──────┬──────┬──────┬──────┬──────────┐
  2. cid1 cid2 cid3 cid4 cid5 cid6
  3. ╞══════╪══════╪══════╪══════╪══════╪══════════╡
  4. 1 5 1.0 4.0 4.0 1.0
  5. ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
  6. 1 5 2.0 5.0 5.0 9.0
  7. ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
  8. 1 5 9.0 6.0 4.0 9.0
  9. ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
  10. 3 7 1.0 7.0 9.0 1.0
  11. ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
  12. | 3 7 3.0 7.0 9.0 1.0
  13. ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
  14. 3 7 8.0 8.0 3.0 1.0
  15. └──────┴──────┴──────┴──────┴──────┴──────────┘

Each combitations of cid1 and cid2 is a workset for analysis and for each workset I have some values cid3.

I can take df with only maximal values of cid3:

  1. df = df.filter(pl.col("cid3") == pl.col("cid3").max().over(["cid1", "cid2"]))

And this display:

  1. ┌──────┬──────┬──────┬──────┬──────┬──────────┐
  2. cid1 cid2 cid3 cid4 cid5 cid6
  3. ╞══════╪══════╪══════╪══════╪══════╪══════════╡
  4. 1 5 9.0 6.0 4.0 9.0
  5. ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
  6. 3 7 8.0 8.0 3.0 1.0
  7. └──────┴──────┴──────┴──────┴──────┴──────────┘

But I can't catch how I can take two maximal values of cid3 for each workset for this result:

  1. ┌──────┬──────┬──────┬──────┬──────┬──────────┐
  2. cid1 cid2 cid3 cid4 cid5 cid6
  3. ╞══════╪══════╪══════╪══════╪══════╪══════════╡
  4. 1 5 2.0 5.0 5.0 9.0
  5. ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
  6. 1 5 9.0 6.0 4.0 9.0
  7. ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
  8. | 3 7 3.0 7.0 9.0 1.0
  9. ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
  10. 3 7 8.0 8.0 3.0 1.0
  11. └──────┴──────┴──────┴──────┴──────┴──────────┘

Two maximal values of cid3 is for example, for my task I can take 10 maximal values, 5 minimal and so that. Help me please!

答案1

得分: 1

  1. 可以使用 [`.top_k()`](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.top_k.html#polars.Expr.top_k) 获取 `k` 个最大(或最小)值。
  2. 如果需要唯一值,可以使用 `.unique().top_k()`
  1. df.groupby("cid1", "cid2").agg(pl.col("cid3").top_k(2))
  1. 形状:(2, 3)
  2. ┌──────┬──────┬────────────┐
  3. cid1 cid2 cid3
  4. --- --- ---
  5. i64 i64 list[f64]
  6. ╞══════╪══════╪════════════╡
  7. 1 5 [9.0, 2.0]
  8. 3 7 [8.0, 3.0]
  9. └──────┴──────┴────────────┘
  1. 可以在 `.filter` 中与 `.is_in` 结合使用。
  1. df.filter(
  2. pl.col("cid3").is_in(pl.col("cid3").top_k(2))
  3. .over("cid1", "cid2")
  4. )
  1. 形状:(4, 6)
  2. ┌──────┬──────┬──────┬──────┬──────┬──────┐
  3. cid1 cid2 cid3 cid4 cid5 cid6
  4. --- --- --- --- --- ---
  5. i64 i64 f64 f64 f64 f64
  6. ╞══════╪══════╪══════╪══════╪══════╪══════╡
  7. 1 5 2.0 5.0 5.0 9.0
  8. 1 5 9.0 6.0 4.0 9.0
  9. 3 7 3.0 7.0 9.0 1.0
  10. 3 7 8.0 8.0 3.0 1.0
  11. └──────┴──────┴──────┴──────┴──────┴──────┘

<strike>使用 descending=True 可以找到最小值 (bottom_k)</strike>

更新: .bottom_k 已添加 ,将在下一版本中发布。

  1. df.filter(
  2. pl.col(&quot;cid3&quot;).is_in(pl.col(&quot;cid3&quot;).bottom_k(2)
  3. .over(&quot;cid1&quot;, &quot;cid2&quot;)
  4. )
  1. 形状:(4, 6)
  2. ┌──────┬──────┬──────┬──────┬──────┬──────┐
  3. cid1 cid2 cid3 cid4 cid5 cid6
  4. --- --- --- --- --- ---
  5. i64 i64 f64 f64 f64 f64
  6. ╞══════╪══════╪══════╪══════╪══════╪══════╡
  7. 1 5 1.0 4.0 4.0 1.0
  8. 1 5 2.0 5.0 5.0 9.0
  9. 3 7 1.0 7.0 9.0 1.0
  10. 3 7 3.0 7.0 9.0 1.0
  11. └──────┴──────┴──────┴──────┴──────┴──────┘

数据框使用:

  1. df = pl.read_csv(b&quot;&quot;&quot;
  2. cid1,cid2,cid3,cid4,cid5,cid6
  3. 1,5,1.0,4.0,4.0,1.0
  4. 1,5,2.0,5.0,5.0,9.0
  5. 1,5,9.0,6.0,4.0,9.0
  6. 3,7,1.0,7.0,9.0,1.0
  7. 3,7,3.0,7.0,9.0,1.0
  8. 3,7,8.0,8.0,3.0,1.0
  9. &quot;&quot;&quot;)
英文:

You can use .top_k() to get the k largest (or smallest) values.

.unique().top_k() can be used if you need distinct values.

  1. df.groupby(&quot;cid1&quot;, &quot;cid2&quot;).agg(pl.col(&quot;cid3&quot;).top_k(2))
  1. shape: (2, 3)
  2. ┌──────┬──────┬────────────┐
  3. cid1 cid2 cid3
  4. --- --- ---
  5. i64 i64 list[f64]
  6. ╞══════╪══════╪════════════╡
  7. 1 5 [9.0, 2.0]
  8. 3 7 [8.0, 3.0]
  9. └──────┴──────┴────────────┘

This can be used inside .filter combined with .is_in

  1. df.filter(
  2. pl.col(&quot;cid3&quot;).is_in(pl.col(&quot;cid3&quot;).top_k(2))
  3. .over(&quot;cid1&quot;, &quot;cid2&quot;)
  4. )
  1. shape: (4, 6)
  2. ┌──────┬──────┬──────┬──────┬──────┬──────┐
  3. cid1 cid2 cid3 cid4 cid5 cid6
  4. --- --- --- --- --- ---
  5. i64 i64 f64 f64 f64 f64
  6. ╞══════╪══════╪══════╪══════╪══════╪══════╡
  7. 1 5 2.0 5.0 5.0 9.0
  8. 1 5 9.0 6.0 4.0 9.0
  9. 3 7 3.0 7.0 9.0 1.0
  10. 3 7 8.0 8.0 3.0 1.0
  11. └──────┴──────┴──────┴──────┴──────┴──────┘

<strike>descending=True to find the minimal values (bottom_k)</strike>

Update: .bottom_k has been added and will be in the next release.

  1. df.filter(
  2. pl.col(&quot;cid3&quot;).is_in(pl.col(&quot;cid3&quot;).bottom_k(2)
  3. .over(&quot;cid1&quot;, &quot;cid2&quot;)
  4. )
  1. shape: (4, 6)
  2. ┌──────┬──────┬──────┬──────┬──────┬──────┐
  3. cid1 cid2 cid3 cid4 cid5 cid6
  4. --- --- --- --- --- ---
  5. i64 i64 f64 f64 f64 f64
  6. ╞══════╪══════╪══════╪══════╪══════╪══════╡
  7. 1 5 1.0 4.0 4.0 1.0
  8. 1 5 2.0 5.0 5.0 9.0
  9. 3 7 1.0 7.0 9.0 1.0
  10. 3 7 3.0 7.0 9.0 1.0
  11. └──────┴──────┴──────┴──────┴──────┴──────┘

Dataframe used:

  1. df = pl.read_csv(b&quot;&quot;&quot;
  2. cid1,cid2,cid3,cid4,cid5,cid6
  3. 1,5,1.0,4.0,4.0,1.0
  4. 1,5,2.0,5.0,5.0,9.0
  5. 1,5,9.0,6.0,4.0,9.0
  6. 3,7,1.0,7.0,9.0,1.0
  7. 3,7,3.0,7.0,9.0,1.0
  8. 3,7,8.0,8.0,3.0,1.0
  9. &quot;&quot;&quot;)

答案2

得分: 1

获取两个最大值

  1. df.filter(
  2. pl.col("cid3").is_in(pl.col("cid3").unique().sort(descending=True).head(2))
  3. .over(["cid1", "cid2"])
  4. )
  5. # 结果
  6. shape: (4, 6)
  7. ┌──────┬──────┬──────┬──────┬──────┬──────┐
  8. cid1 cid2 cid3 cid4 cid5 cid6
  9. --- --- --- --- --- ---
  10. i64 i64 f64 f64 f64 f64
  11. ╞══════╪══════╪══════╪══════╪══════╪══════╡
  12. 1 5 2.0 5.0 5.0 9.0
  13. 1 5 9.0 6.0 4.0 9.0
  14. 3 7 3.0 7.0 9.0 1.0
  15. 3 7 8.0 8.0 3.0 1.0
  16. └──────┴──────┴──────┴──────┴──────┴──────┘

获取两个最小值

  1. (df.filter(
  2. pl.col("cid3").is_in(pl.col("cid3").unique().sort(descending=False).head(2))
  3. .over(["cid1", "cid2"])
  4. )
  5. # 结果
  6. shape: (4, 6)
  7. ┌──────┬──────┬──────┬──────┬──────┬──────┐
  8. cid1 cid2 cid3 cid4 cid5 cid6
  9. --- --- --- --- --- ---
  10. i64 i64 f64 f64 f64 f64
  11. ╞══════╪══════╪══════╪══════╪══════╪══════╡
  12. 1 5 1.0 4.0 4.0 1.0
  13. 1 5 2.0 5.0 5.0 9.0
  14. 3 7 1.0 7.0 9.0 1.0
  15. 3 7 3.0 7.0 9.0 1.0
  16. └──────┴──────┴──────┴──────┴──────┴──────┘
英文:

Here is one more possibility in case you want to get maximum or minimum values

Getting 2 largest values

  1. df.filter(
  2. pl.col(&quot;cid3&quot;).is_in(pl.col(&quot;cid3&quot;).unique().sort(descending=True).head(2))
  3. .over([&quot;cid1&quot;, &quot;cid2&quot;])
  4. )
  5. # Result
  6. shape: (4, 6)
  7. ┌──────┬──────┬──────┬──────┬──────┬──────┐
  8. cid1 cid2 cid3 cid4 cid5 cid6
  9. --- --- --- --- --- ---
  10. i64 i64 f64 f64 f64 f64
  11. ╞══════╪══════╪══════╪══════╪══════╪══════╡
  12. 1 5 2.0 5.0 5.0 9.0
  13. 1 5 9.0 6.0 4.0 9.0
  14. 3 7 3.0 7.0 9.0 1.0
  15. 3 7 8.0 8.0 3.0 1.0
  16. └──────┴──────┴──────┴──────┴──────┴──────┘

Getting 2 smallest values

  1. (df.filter(
  2. pl.col(&quot;cid3&quot;).is_in(pl.col(&quot;cid3&quot;).unique().sort(descending=False).head(2))
  3. .over([&quot;cid1&quot;, &quot;cid2&quot;])
  4. )
  5. # Result
  6. shape: (4, 6)
  7. ┌──────┬──────┬──────┬──────┬──────┬──────┐
  8. cid1 cid2 cid3 cid4 cid5 cid6
  9. --- --- --- --- --- ---
  10. i64 i64 f64 f64 f64 f64
  11. ╞══════╪══════╪══════╪══════╪══════╪══════╡
  12. 1 5 1.0 4.0 4.0 1.0
  13. 1 5 2.0 5.0 5.0 9.0
  14. 3 7 1.0 7.0 9.0 1.0
  15. 3 7 3.0 7.0 9.0 1.0
  16. └──────┴──────┴──────┴──────┴──────┴──────┘

huangapple
  • 本文由 发表于 2023年3月31日 22:00:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/75899402.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定