英文:
Detect missing values in polars based on conditions
问题
我现在正在识别具有缺失值的行并将它们过滤到报告中。我正在添加一个索引和一个"warning"列。
report = (df.with_row_count("index")
.filter(pl.any(pl.col("*").is_null()) & pl.col("mandatory").eq("M"))
.with_columns(pl.lit("Missing value detected").alias("warning"))
)
当前的要求是,如果在"mandatory"列中出现值"M"并且在同一行中出现缺失值,那么这一行将被添加到报告中。现在有一个新的附加要求。我将根据示例数据集来解释它。在"column"列中可能出现相同的值,比如"foo"。根据第一条规则,第2行应该被写入报告。然而,现在应该有一个异常条件来阻止这种情况。如果已经有一行在"column"列中没有零值,那么其他相同的值在进一步考虑中将不被考虑。在这种情况下,第1行满足这个条件,所以第2行不应该被考虑。
df = pl.DataFrame(
{
"ID": ["1", "1", "1", "1", "1"],
"column": ["foo", "foo", "bar", "ham", "egg"],
"table": ["A", "B", "C", "D", "E"],
"value_a": ["tree", None, None, "bean", None,],
"value_b": ["Lorem", "Ipsum", "Dal", "Curry", "Dish",],
"mandatory": ["M", "M", "M", "CM", "M"],
}
)
期望的报告应该如下所示:
shape: (2, 8)
┌───────┬─────┬────────┬───────┬─────────┬─────────┬───────────┬────────────────────────┐
│ Index ┆ ID ┆ column ┆ table ┆ value_a ┆ value_b ┆ mandatory ┆ warning │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ f64 ┆ str ┆ str ┆ str │
╞═══════╪═════╪════════╪═══════╪═════════╪═════════╪═══════════╪════════════════════════╡
│ 1 ┆ 1 ┆ bar ┆ C ┆ null ┆ Dal ┆ M ┆ Missing value detected │
│ 2 ┆ 1 ┆ egg ┆ E ┆ null ┆ Dish ┆ M ┆ Missing value detected │
└───────┴─────┴────────┴───────┴─────────┴─────────┴───────────┴────────────────────────┘
如何实现附加条件来正确识别具有缺失值的行呢?
要实现附加条件,您可以使用以下代码:
report = (df.with_row_count("index")
.filter(pl.any(pl.col("*").is_null()) & pl.col("mandatory").eq("M") & ~pl.col("column").is_not_null())
.with_columns(pl.lit("Missing value detected").alias("warning"))
)
这里的关键是在过滤条件中添加了& ~pl.col("column").is_not_null()
,它表示如果在同一组中存在至少一个具有非空值的"column",则不包括该组。这将满足您的附加要求,确保在已经有一行没有零值的情况下,其他相同值的行不被考虑。
英文:
I am currently identifying rows with missing values and filtering them into a report. I am adding an index and a "warning" column.
report = (df.with_row_count("index")
.filter(pl.any(pl.col("*").is_null()) & pl.col("mandatory").eq("M"))
.with_columns(pl.lit("Missing value detected").alias("warning"))
)
The current requirement is that if the value "M" occurs in column "mandatory" and a missing value occurs in the same row, this row is added to the report. Now there is a new additional requirement. I will explain it on the basis of the example dataset. It can be that in the column "column" can be the same values. In this case "foo". According to the first rule, row 2 should be written into the report. However, an exception condition should now prevent this. If already one row has no zeros in the column "column", all other equal values are not considered in the further consideration. In this case, row 1 fulfills this condition, so row two should not be considered.
df = pl.DataFrame(
{
"ID": ["1", "1", "1", "1", "1"],
"column": ["foo", "foo", "bar", "ham", "egg"],
"table": ["A", "B", "C", "D", "E"],
"value_a": ["tree", None, None, "bean", None,],
"value_b": ["Lorem", "Ipsum", "Dal", "Curry", "Dish",],
"mandatory": ["M", "M", "M", "CM", "M"],
}
)
print(df)
shape: (5, 6)
┌─────┬────────┬───────┬─────────┬─────────┬───────────┐
│ ID ┆ column ┆ table ┆ value_a ┆ value_b ┆ mandatory │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╡
│ 1 ┆ foo ┆ A ┆ tree ┆ Lorem ┆ M │
│ 1 ┆ foo ┆ B ┆ null ┆ Ipsum ┆ M │
│ 1 ┆ bar ┆ C ┆ null ┆ Dal ┆ M │
│ 1 ┆ ham ┆ D ┆ bean ┆ Curry ┆ CM │
│ 1 ┆ egg ┆ E ┆ null ┆ Dish ┆ M │
└─────┴────────┴───────┴─────────┴─────────┴───────────┘
The expected report should look like this
shape: (2, 8)
┌───────┬─────┬────────┬───────┬─────────┬─────────┬───────────┬────────────────────────┐
│ Index ┆ ID ┆ column ┆ table ┆ value_a ┆ value_b ┆ mandatory ┆ warning │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ f64 ┆ str ┆ str ┆ str │
╞═══════╪═════╪════════╪═══════╪═════════╪═════════╪═══════════╪════════════════════════╡
│ 1 ┆ 1 ┆ bar ┆ C ┆ null ┆ Dal ┆ M ┆ Missing value detected │
│ 2 ┆ 1 ┆ egg ┆ E ┆ null ┆ Dish ┆ M ┆ Missing value detected │
└───────┴─────┴────────┴───────┴─────────┴─────────┴───────────┴────────────────────────┘
How can I implement the additional condition to identfy the rows with missing values correctly?
答案1
得分: 4
这是一个定义附加规则的方式,可以与其他规则组合使用:
df.with_columns(~pl.all(pl.col("column").is_not_null()).any().over("column"))
shape: (5, 7)
┌─────┬────────┬───────┬─────────┬─────────┬───────────┬───────┐
│ ID ┆ column ┆ table ┆ value_a ┆ value_b ┆ mandatory ┆ all │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ bool │
╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═══════╡
│ 1 ┆ foo ┆ A ┆ tree ┆ Lorem ┆ M ┆ false │
│ 1 ┆ foo ┆ B ┆ null ┆ Ipsum ┆ M ┆ false │
│ 1 ┆ bar ┆ C ┆ null ┆ Dal ┆ M ┆ true │
│ 1 ┆ ham ┆ D ┆ bean ┆ Curry ┆ CM ┆ false │
│ 1 ┆ egg ┆ E ┆ null ┆ Dish ┆ M ┆ true │
└─────┴────────┴───────┴─────────┴─────────┴───────────┴───────┘
英文:
Here's one way to define the additional rule which could be combined with the others:
df.with_columns(~pl.all(pl.col("*").is_not_null()).any().over("column"))
shape: (5, 7)
┌─────┬────────┬───────┬─────────┬─────────┬───────────┬───────┐
│ ID ┆ column ┆ table ┆ value_a ┆ value_b ┆ mandatory ┆ all │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ bool │
╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═══════╡
│ 1 ┆ foo ┆ A ┆ tree ┆ Lorem ┆ M ┆ false │
│ 1 ┆ foo ┆ B ┆ null ┆ Ipsum ┆ M ┆ false │
│ 1 ┆ bar ┆ C ┆ null ┆ Dal ┆ M ┆ true │
│ 1 ┆ ham ┆ D ┆ bean ┆ Curry ┆ CM ┆ false │
│ 1 ┆ egg ┆ E ┆ null ┆ Dish ┆ M ┆ true │
└─────┴────────┴───────┴─────────┴─────────┴───────────┴───────┘
答案2
得分: 2
我建议您创建一个子表格,用于检查您的条件,然后您可以将其与主数据框连接。
df.join(
other = df.groupby("column").agg(
null_cond = pl.any(pl.all().is_null()).all()
),
on="column", how="left"
).filter((pl.col("mandatory") == "M") & pl.col("null_cond"))
我忘了窗口函数!@jqurious 提出了更好且更优雅的解决方案:
df.filter(
(pl.col("mandatory") == "M") &
pl.any(pl.all().is_null()).all().over("column")
)
英文:
I suggest you to create sub-table that will check your condition and then you can join it with main df.
df.join(
other = df.groupby("column").agg(
null_cond = pl.any(pl.all().is_null()).all()
),
on="column", how="left"
).filter((pl.col("mandatory") == "M") & pl.col("null_cond"))
And I forgot about window functions! @jqurious has suggested much better & elegant solution:
df.filter(
(pl.col("mandatory") == "M") &
pl.any(pl.all().is_null()).all().over("column")
)
┌─────┬────────┬───────┬─────────┬─────────┬───────────┐
│ ID ┆ column ┆ table ┆ value_a ┆ value_b ┆ mandatory │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╡
│ 1 ┆ bar ┆ C ┆ null ┆ Dal ┆ M │
│ 1 ┆ egg ┆ E ┆ null ┆ Dish ┆ M │
└─────┴────────┴───────┴─────────┴─────────┴───────────┘
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论