检测在Polars中未给定唯一性的行。

huangapple go评论137阅读模式
英文:

Detect rows where uniqueness is not given in polars

问题

I understand that you want to translate the provided code and text. Here is the translated content:

  1. 目前我有以下问题我需要检查列值`ID``table``value_a`是否不唯一时是否存在侵权
  2. df = pl.DataFrame(
  3. {
  4. "ID": ["1", "1", "1", "1", "1"],
  5. "column": ["foo", "foo", "bar", "ham", "egg"],
  6. "table": ["A", "A", "C", "D", "E"],
  7. "value_a": ["tree", tree, None, "bean", None,],
  8. "value_b": ["Lorem", "Ipsum", "Dal", "Curry", "Dish",],
  9. "mandatory": ["M", "M", "M", "CM", "M"]
  10. }
  11. )
  12. print(df)
  13. shape: (5, 6)
  14. ┌─────┬────────┬───────┬─────────┬─────────┬───────────┐
  15. ID column table value_a value_b mandatory
  16. --- --- --- --- --- ---
  17. str str str str str str
  18. ╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╡
  19. 1 foo A tree Lorem M
  20. 1 foo B tree Ipsum M
  21. 1 bar C null Dal M
  22. 1 ham D bean Curry M
  23. 1 egg E null Dish M
  24. └─────┴────────┴───────┴─────────┴─────────┴───────────┘
  25. 对于df应创建侵权报告其中包含以下专用输出
  26. shape: (2, 8)
  27. ┌───────┬─────┬────────┬───────┬─────────┬─────────┬───────────┬─────────────────────────┐
  28. index ID column table value_a value_b mandatory warning
  29. --- --- --- --- --- --- --- ---
  30. i64 str str str str str str str
  31. ╞═══════╪═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═════════════════════════╡
  32. 0 1 foo A tree Lorem M 行值不唯一
  33. 1 1 foo A tree Ipsum M 行值不唯一
  34. └───────┴─────┴────────┴───────┴─────────┴─────────┴───────────┴─────────────────────────┘
  35. 报告应包含一个`index`和一个`warning`我使用以下代码行来识别行中是否有任何空值
  36. report = (df.with_row_count("index")
  37. .filter(pl.any(pl.col("*").is_null()) & pl.col("mandatory").eq("M"))
  38. .with_columns(pl.lit("检测到缺失值").alias("warning"))
  39. )
  40. 如何调整此代码以便在一方面检测缺失值另一方面识别不唯一的行也许我可以创建两个报告然后使用`.vstack()`将两个报告组合成最终报告您会如何解决这个问题
英文:

Currently I have the following problem. I have to check if there is an infringement if the column values ID, table and value_a are not unique.

  1. df = pl.DataFrame(
  2. {
  3. "ID": ["1", "1", "1", "1", "1"],
  4. "column": ["foo", "foo", "bar", "ham", "egg"],
  5. "table": ["A", "A", "C", "D", "E"],
  6. "value_a": ["tree", tree, None, "bean", None,],
  7. "value_b": ["Lorem", "Ipsum", "Dal", "Curry", "Dish",],
  8. "mandatory": ["M", "M", "M", "CM", "M"]
  9. }
  10. )
  11. print(df)
  12. shape: (5, 6)
  13. ┌─────┬────────┬───────┬─────────┬─────────┬───────────┐
  14. ID column table value_a value_b mandatory
  15. --- --- --- --- --- ---
  16. str str str str str str
  17. ╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╡
  18. 1 foo A tree Lorem M
  19. 1 foo B tree Ipsum M
  20. 1 bar C null Dal M
  21. 1 ham D bean Curry M
  22. 1 egg E null Dish M
  23. └─────┴────────┴───────┴─────────┴─────────┴───────────┘

In the case of df a infringement report should be created with the following dedicated output:

  1. shape: (2, 8)
  2. ┌───────┬─────┬────────┬───────┬─────────┬─────────┬───────────┬─────────────────────────┐
  3. index ID column table value_a value_b mandatory warning
  4. --- --- --- --- --- --- --- ---
  5. i64 str str str str str str str
  6. ╞═══════╪═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═════════════════════════╡
  7. 0 1 foo A tree Lorem M Row value is not unique
  8. 1 1 foo A tree Ipsum M Row value is not unique
  9. └───────┴─────┴────────┴───────┴─────────┴─────────┴───────────┴─────────────────────────┘

The report should contain an index and a warning column. I used this line of code to identify if there are any null values in a row:

  1. report = (df.with_row_count("index")
  2. .filter(pl.any(pl.col("*").is_null()) & pl.col("mandatory").eq("M"))
  3. .with_columns(pl.lit("Missing value detected").alias("warning"))
  4. )

How do I need to adapt this code so on the one hand I detect missing values and on the other hand I identify ununique rows. Maybe I create two reports and use .vstack() to combine both reports to a final one. How would you solve it?

答案1

得分: 3

你可以创建一个 struct 并使用 .is_duplicated

  1. df.with_columns(
  2. warning = pl.struct(["ID", "table", "value_a"]).is_duplicated()
  3. )
  1. shape: (5, 7)
  2. ┌─────┬────────┬───────┬─────────┬─────────┬───────────┬─────────┐
  3. ID | column | table | value_a | value_b | mandatory | warning
  4. --- | --- | --- | --- | --- | --- | ---
  5. ╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═════════╡
  6. 1 | foo | A | tree | Lorem | M | true
  7. 1 | foo | A | tree | Ipsum | M | true
  8. 1 | bar | C | null | Dal | M | false
  9. 1 | ham | D | bean | Curry | CM | false
  10. 1 | egg | E | null | Dish | M | false
  11. └─────┴────────┴───────┴─────────┴─────────┴───────────┴─────────┘
英文:

You can create a struct and use .is_duplicated

  1. df.with_columns(
  2. warning = pl.struct(["ID", "table", "value_a"]).is_duplicated()
  3. )
  1. shape: (5, 7)
  2. ┌─────┬────────┬───────┬─────────┬─────────┬───────────┬─────────┐
  3. ID | column | table | value_a | value_b | mandatory | warning
  4. --- | --- | --- | --- | --- | --- | ---
  5. str | str | str | str | str | str | bool
  6. ╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═════════╡
  7. 1 | foo | A | tree | Lorem | M | true
  8. 1 | foo | A | tree | Ipsum | M | true
  9. 1 | bar | C | null | Dal | M | false
  10. 1 | ham | D | bean | Curry | CM | false
  11. 1 | egg | E | null | Dish | M | false
  12. └─────┴────────┴───────┴─────────┴─────────┴───────────┴─────────┘

huangapple
  • 本文由 发表于 2023年3月9日 21:01:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/75684994.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定