在 Polars 中基于条件检测缺失值。

huangapple go评论92阅读模式
英文:

Detect missing values in polars based on conditions

问题

我现在正在识别具有缺失值的行并将它们过滤到报告中。我正在添加一个索引和一个"warning"列。

  1. report = (df.with_row_count("index")
  2. .filter(pl.any(pl.col("*").is_null()) & pl.col("mandatory").eq("M"))
  3. .with_columns(pl.lit("Missing value detected").alias("warning"))
  4. )

当前的要求是,如果在"mandatory"列中出现值"M"并且在同一行中出现缺失值,那么这一行将被添加到报告中。现在有一个新的附加要求。我将根据示例数据集来解释它。在"column"列中可能出现相同的值,比如"foo"。根据第一条规则,第2行应该被写入报告。然而,现在应该有一个异常条件来阻止这种情况。如果已经有一行在"column"列中没有零值,那么其他相同的值在进一步考虑中将不被考虑。在这种情况下,第1行满足这个条件,所以第2行不应该被考虑。

  1. df = pl.DataFrame(
  2. {
  3. "ID": ["1", "1", "1", "1", "1"],
  4. "column": ["foo", "foo", "bar", "ham", "egg"],
  5. "table": ["A", "B", "C", "D", "E"],
  6. "value_a": ["tree", None, None, "bean", None,],
  7. "value_b": ["Lorem", "Ipsum", "Dal", "Curry", "Dish",],
  8. "mandatory": ["M", "M", "M", "CM", "M"],
  9. }
  10. )

期望的报告应该如下所示:

  1. shape: (2, 8)
  2. ┌───────┬─────┬────────┬───────┬─────────┬─────────┬───────────┬────────────────────────┐
  3. Index ID column table value_a value_b mandatory warning
  4. --- --- --- --- --- --- --- ---
  5. i64 str str str f64 str str str
  6. ╞═══════╪═════╪════════╪═══════╪═════════╪═════════╪═══════════╪════════════════════════╡
  7. 1 1 bar C null Dal M Missing value detected
  8. 2 1 egg E null Dish M Missing value detected
  9. └───────┴─────┴────────┴───────┴─────────┴─────────┴───────────┴────────────────────────┘

如何实现附加条件来正确识别具有缺失值的行呢?

要实现附加条件,您可以使用以下代码:

  1. report = (df.with_row_count("index")
  2. .filter(pl.any(pl.col("*").is_null()) & pl.col("mandatory").eq("M") & ~pl.col("column").is_not_null())
  3. .with_columns(pl.lit("Missing value detected").alias("warning"))
  4. )

这里的关键是在过滤条件中添加了& ~pl.col("column").is_not_null(),它表示如果在同一组中存在至少一个具有非空值的"column",则不包括该组。这将满足您的附加要求,确保在已经有一行没有零值的情况下,其他相同值的行不被考虑。

英文:

I am currently identifying rows with missing values and filtering them into a report. I am adding an index and a "warning" column.

  1. report = (df.with_row_count("index")
  2. .filter(pl.any(pl.col("*").is_null()) & pl.col("mandatory").eq("M"))
  3. .with_columns(pl.lit("Missing value detected").alias("warning"))
  4. )

The current requirement is that if the value "M" occurs in column "mandatory" and a missing value occurs in the same row, this row is added to the report. Now there is a new additional requirement. I will explain it on the basis of the example dataset. It can be that in the column "column" can be the same values. In this case "foo". According to the first rule, row 2 should be written into the report. However, an exception condition should now prevent this. If already one row has no zeros in the column "column", all other equal values are not considered in the further consideration. In this case, row 1 fulfills this condition, so row two should not be considered.

  1. df = pl.DataFrame(
  2. {
  3. "ID": ["1", "1", "1", "1", "1"],
  4. "column": ["foo", "foo", "bar", "ham", "egg"],
  5. "table": ["A", "B", "C", "D", "E"],
  6. "value_a": ["tree", None, None, "bean", None,],
  7. "value_b": ["Lorem", "Ipsum", "Dal", "Curry", "Dish",],
  8. "mandatory": ["M", "M", "M", "CM", "M"],
  9. }
  10. )
  11. print(df)
  12. shape: (5, 6)
  13. ┌─────┬────────┬───────┬─────────┬─────────┬───────────┐
  14. ID column table value_a value_b mandatory
  15. --- --- --- --- --- ---
  16. str str str str str str
  17. ╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╡
  18. 1 foo A tree Lorem M
  19. 1 foo B null Ipsum M
  20. 1 bar C null Dal M
  21. 1 ham D bean Curry CM
  22. 1 egg E null Dish M
  23. └─────┴────────┴───────┴─────────┴─────────┴───────────┘

The expected report should look like this

  1. shape: (2, 8)
  2. ┌───────┬─────┬────────┬───────┬─────────┬─────────┬───────────┬────────────────────────┐
  3. Index ID column table value_a value_b mandatory warning
  4. --- --- --- --- --- --- --- ---
  5. i64 str str str f64 str str str
  6. ╞═══════╪═════╪════════╪═══════╪═════════╪═════════╪═══════════╪════════════════════════╡
  7. 1 1 bar C null Dal M Missing value detected
  8. 2 1 egg E null Dish M Missing value detected
  9. └───────┴─────┴────────┴───────┴─────────┴─────────┴───────────┴────────────────────────┘

How can I implement the additional condition to identfy the rows with missing values correctly?

答案1

得分: 4

这是一个定义附加规则的方式,可以与其他规则组合使用:

  1. df.with_columns(~pl.all(pl.col("column").is_not_null()).any().over("column"))
  1. shape: (5, 7)
  2. ┌─────┬────────┬───────┬─────────┬─────────┬───────────┬───────┐
  3. ID column table value_a value_b mandatory all
  4. --- --- --- --- --- --- bool
  5. ╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═══════╡
  6. 1 foo A tree Lorem M false
  7. 1 foo B null Ipsum M false
  8. 1 bar C null Dal M true
  9. 1 ham D bean Curry CM false
  10. 1 egg E null Dish M true
  11. └─────┴────────┴───────┴─────────┴─────────┴───────────┴───────┘
英文:

Here's one way to define the additional rule which could be combined with the others:

  1. df.with_columns(~pl.all(pl.col("*").is_not_null()).any().over("column"))
  1. shape: (5, 7)
  2. ┌─────┬────────┬───────┬─────────┬─────────┬───────────┬───────┐
  3. ID column table value_a value_b mandatory all
  4. --- --- --- --- --- --- ---
  5. str str str str str str bool
  6. ╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═══════╡
  7. 1 foo A tree Lorem M false
  8. 1 foo B null Ipsum M false
  9. 1 bar C null Dal M true
  10. 1 ham D bean Curry CM false
  11. 1 egg E null Dish M true
  12. └─────┴────────┴───────┴─────────┴─────────┴───────────┴───────┘

答案2

得分: 2

我建议您创建一个子表格,用于检查您的条件,然后您可以将其与主数据框连接。

  1. df.join(
  2. other = df.groupby("column").agg(
  3. null_cond = pl.any(pl.all().is_null()).all()
  4. ),
  5. on="column", how="left"
  6. ).filter((pl.col("mandatory") == "M") & pl.col("null_cond"))

我忘了窗口函数@jqurious 提出了更好且更优雅的解决方案:

  1. df.filter(
  2. (pl.col("mandatory") == "M") &
  3. pl.any(pl.all().is_null()).all().over("column")
  4. )
英文:

I suggest you to create sub-table that will check your condition and then you can join it with main df.

  1. df.join(
  2. other = df.groupby("column").agg(
  3. null_cond = pl.any(pl.all().is_null()).all()
  4. ),
  5. on="column", how="left"
  6. ).filter((pl.col("mandatory") == "M") & pl.col("null_cond"))

And I forgot about window functions! @jqurious has suggested much better & elegant solution:

  1. df.filter(
  2. (pl.col("mandatory") == "M") &
  3. pl.any(pl.all().is_null()).all().over("column")
  4. )
  1. ┌─────┬────────┬───────┬─────────┬─────────┬───────────┐
  2. ID column table value_a value_b mandatory
  3. --- --- --- --- --- ---
  4. str str str str str str
  5. ╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╡
  6. 1 bar C null Dal M
  7. 1 egg E null Dish M
  8. └─────┴────────┴───────┴─────────┴─────────┴───────────┘

huangapple
  • 本文由 发表于 2023年3月4日 01:20:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/75630101.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定