英文:
Detect rows where uniqueness is not given in polars
问题
I understand that you want to translate the provided code and text. Here is the translated content:
目前我有以下问题。我需要检查列值`ID`,`table`和`value_a`是否不唯一时是否存在侵权。
df = pl.DataFrame(
{
"ID": ["1", "1", "1", "1", "1"],
"column": ["foo", "foo", "bar", "ham", "egg"],
"table": ["A", "A", "C", "D", "E"],
"value_a": ["tree", tree, None, "bean", None,],
"value_b": ["Lorem", "Ipsum", "Dal", "Curry", "Dish",],
"mandatory": ["M", "M", "M", "CM", "M"]
}
)
print(df)
shape: (5, 6)
┌─────┬────────┬───────┬─────────┬─────────┬───────────┐
│ ID ┆ column ┆ table ┆ value_a ┆ value_b ┆ mandatory │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╡
│ 1 ┆ foo ┆ A ┆ tree ┆ Lorem ┆ M │
│ 1 ┆ foo ┆ B ┆ tree ┆ Ipsum ┆ M │
│ 1 ┆ bar ┆ C ┆ null ┆ Dal ┆ M │
│ 1 ┆ ham ┆ D ┆ bean ┆ Curry ┆ M │
│ 1 ┆ egg ┆ E ┆ null ┆ Dish ┆ M │
└─────┴────────┴───────┴─────────┴─────────┴───────────┘
对于df,应创建侵权报告,其中包含以下专用输出:
shape: (2, 8)
┌───────┬─────┬────────┬───────┬─────────┬─────────┬───────────┬─────────────────────────┐
│ index ┆ ID ┆ column ┆ table ┆ value_a ┆ value_b ┆ mandatory ┆ warning │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═══════╪═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═════════════════════════╡
│ 0 ┆ 1 ┆ foo ┆ A ┆ tree ┆ Lorem ┆ M ┆ 行值不唯一 │
│ 1 ┆ 1 ┆ foo ┆ A ┆ tree ┆ Ipsum ┆ M ┆ 行值不唯一 │
└───────┴─────┴────────┴───────┴─────────┴─────────┴───────────┴─────────────────────────┘
报告应包含一个`index`和一个`warning`列。我使用以下代码行来识别行中是否有任何空值:
report = (df.with_row_count("index")
.filter(pl.any(pl.col("*").is_null()) & pl.col("mandatory").eq("M"))
.with_columns(pl.lit("检测到缺失值").alias("warning"))
)
如何调整此代码以便在一方面检测缺失值,另一方面识别不唯一的行?也许我可以创建两个报告,然后使用`.vstack()`将两个报告组合成最终报告。您会如何解决这个问题?
英文:
Currently I have the following problem. I have to check if there is an infringement if the column values ID
, table
and value_a
are not unique.
df = pl.DataFrame(
{
"ID": ["1", "1", "1", "1", "1"],
"column": ["foo", "foo", "bar", "ham", "egg"],
"table": ["A", "A", "C", "D", "E"],
"value_a": ["tree", tree, None, "bean", None,],
"value_b": ["Lorem", "Ipsum", "Dal", "Curry", "Dish",],
"mandatory": ["M", "M", "M", "CM", "M"]
}
)
print(df)
shape: (5, 6)
┌─────┬────────┬───────┬─────────┬─────────┬───────────┐
│ ID ┆ column ┆ table ┆ value_a ┆ value_b ┆ mandatory │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╡
│ 1 ┆ foo ┆ A ┆ tree ┆ Lorem ┆ M │
│ 1 ┆ foo ┆ B ┆ tree ┆ Ipsum ┆ M │
│ 1 ┆ bar ┆ C ┆ null ┆ Dal ┆ M │
│ 1 ┆ ham ┆ D ┆ bean ┆ Curry ┆ M │
│ 1 ┆ egg ┆ E ┆ null ┆ Dish ┆ M │
└─────┴────────┴───────┴─────────┴─────────┴───────────┘
In the case of df a infringement report should be created with the following dedicated output:
shape: (2, 8)
┌───────┬─────┬────────┬───────┬─────────┬─────────┬───────────┬─────────────────────────┐
│ index ┆ ID ┆ column ┆ table ┆ value_a ┆ value_b ┆ mandatory ┆ warning │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═══════╪═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═════════════════════════╡
│ 0 ┆ 1 ┆ foo ┆ A ┆ tree ┆ Lorem ┆ M ┆ Row value is not unique │
│ 1 ┆ 1 ┆ foo ┆ A ┆ tree ┆ Ipsum ┆ M ┆ Row value is not unique │
└───────┴─────┴────────┴───────┴─────────┴─────────┴───────────┴─────────────────────────┘
The report should contain an index
and a warning
column. I used this line of code to identify if there are any null values in a row:
report = (df.with_row_count("index")
.filter(pl.any(pl.col("*").is_null()) & pl.col("mandatory").eq("M"))
.with_columns(pl.lit("Missing value detected").alias("warning"))
)
How do I need to adapt this code so on the one hand I detect missing values and on the other hand I identify ununique rows. Maybe I create two reports and use .vstack() to combine both reports to a final one. How would you solve it?
答案1
得分: 3
你可以创建一个 struct
并使用 .is_duplicated
df.with_columns(
warning = pl.struct(["ID", "table", "value_a"]).is_duplicated()
)
shape: (5, 7)
┌─────┬────────┬───────┬─────────┬─────────┬───────────┬─────────┐
│ ID | column | table | value_a | value_b | mandatory | warning │
│ --- | --- | --- | --- | --- | --- | --- │
╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═════════╡
│ 1 | foo | A | tree | Lorem | M | true │
│ 1 | foo | A | tree | Ipsum | M | true │
│ 1 | bar | C | null | Dal | M | false │
│ 1 | ham | D | bean | Curry | CM | false │
│ 1 | egg | E | null | Dish | M | false │
└─────┴────────┴───────┴─────────┴─────────┴───────────┴─────────┘
英文:
You can create a struct
and use .is_duplicated
df.with_columns(
warning = pl.struct(["ID", "table", "value_a"]).is_duplicated()
)
shape: (5, 7)
┌─────┬────────┬───────┬─────────┬─────────┬───────────┬─────────┐
│ ID | column | table | value_a | value_b | mandatory | warning │
│ --- | --- | --- | --- | --- | --- | --- │
│ str | str | str | str | str | str | bool │
╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═════════╡
│ 1 | foo | A | tree | Lorem | M | true │
│ 1 | foo | A | tree | Ipsum | M | true │
│ 1 | bar | C | null | Dal | M | false │
│ 1 | ham | D | bean | Curry | CM | false │
│ 1 | egg | E | null | Dish | M | false │
└─────┴────────┴───────┴─────────┴─────────┴───────────┴─────────┘
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论