检测在Polars中未给定唯一性的行。

huangapple go评论107阅读模式
英文:

Detect rows where uniqueness is not given in polars

问题

I understand that you want to translate the provided code and text. Here is the translated content:

目前我有以下问题我需要检查列值`ID`,`table``value_a`是否不唯一时是否存在侵权

        df = pl.DataFrame(
            {
            "ID": ["1", "1", "1", "1", "1"],
            "column": ["foo", "foo", "bar", "ham", "egg"],
            "table": ["A", "A", "C", "D", "E"],
            "value_a": ["tree", tree, None, "bean", None,],
            "value_b": ["Lorem", "Ipsum", "Dal", "Curry", "Dish",],
            "mandatory": ["M", "M", "M", "CM", "M"]
            }
        )
        print(df)
        shape: (5, 6)
        ┌─────┬────────┬───────┬─────────┬─────────┬───────────┐
         ID   column  table  value_a  value_b  mandatory 
         ---  ---     ---    ---      ---      ---       
         str  str     str    str      str      str       
        ╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╡
         1    foo     A      tree     Lorem    M         
         1    foo     B      tree     Ipsum    M         
         1    bar     C      null     Dal      M         
         1    ham     D      bean     Curry    M         
         1    egg     E      null     Dish     M         
        └─────┴────────┴───────┴─────────┴─────────┴───────────┘

对于df应创建侵权报告其中包含以下专用输出

    shape: (2, 8)
    ┌───────┬─────┬────────┬───────┬─────────┬─────────┬───────────┬─────────────────────────┐
     index  ID   column  table  value_a  value_b  mandatory  warning                 
     ---    ---  ---     ---    ---      ---      ---        ---                     
     i64    str  str     str    str      str      str        str                     
    ╞═══════╪═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═════════════════════════╡
     0      1    foo     A      tree     Lorem    M          行值不唯一            
     1      1    foo     A      tree     Ipsum    M          行值不唯一            
    └───────┴─────┴────────┴───────┴─────────┴─────────┴───────────┴─────────────────────────┘

报告应包含一个`index`和一个`warning`我使用以下代码行来识别行中是否有任何空值

    report = (df.with_row_count("index")
                    .filter(pl.any(pl.col("*").is_null()) & pl.col("mandatory").eq("M"))
                    .with_columns(pl.lit("检测到缺失值").alias("warning"))
             )

如何调整此代码以便在一方面检测缺失值另一方面识别不唯一的行也许我可以创建两个报告然后使用`.vstack()`将两个报告组合成最终报告您会如何解决这个问题
英文:

Currently I have the following problem. I have to check if there is an infringement if the column values ID, table and value_a are not unique.

    df = pl.DataFrame(
{
"ID": ["1", "1", "1", "1", "1"],
"column": ["foo", "foo", "bar", "ham", "egg"],
"table": ["A", "A", "C", "D", "E"],
"value_a": ["tree", tree, None, "bean", None,],
"value_b": ["Lorem", "Ipsum", "Dal", "Curry", "Dish",],
"mandatory": ["M", "M", "M", "CM", "M"]
}
)
print(df)
shape: (5, 6)
┌─────┬────────┬───────┬─────────┬─────────┬───────────┐
│ ID  ┆ column ┆ table ┆ value_a ┆ value_b ┆ mandatory │
│ --- ┆ ---    ┆ ---   ┆ ---     ┆ ---     ┆ ---       │
│ str ┆ str    ┆ str   ┆ str     ┆ str     ┆ str       │
╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╡
│ 1   ┆ foo    ┆ A     ┆ tree    ┆ Lorem   ┆ M         │
│ 1   ┆ foo    ┆ B     ┆ tree    ┆ Ipsum   ┆ M         │
│ 1   ┆ bar    ┆ C     ┆ null    ┆ Dal     ┆ M         │
│ 1   ┆ ham    ┆ D     ┆ bean    ┆ Curry   ┆ M         │
│ 1   ┆ egg    ┆ E     ┆ null    ┆ Dish    ┆ M         │
└─────┴────────┴───────┴─────────┴─────────┴───────────┘

In the case of df a infringement report should be created with the following dedicated output:

shape: (2, 8)
┌───────┬─────┬────────┬───────┬─────────┬─────────┬───────────┬─────────────────────────┐
│ index ┆ ID  ┆ column ┆ table ┆ value_a ┆ value_b ┆ mandatory ┆ warning                 │
│ ---   ┆ --- ┆ ---    ┆ ---   ┆ ---     ┆ ---     ┆ ---       ┆ ---                     │
│ i64   ┆ str ┆ str    ┆ str   ┆ str     ┆ str     ┆ str       ┆ str                     │
╞═══════╪═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═════════════════════════╡
│ 0     ┆ 1   ┆ foo    ┆ A     ┆ tree    ┆ Lorem   ┆ M         ┆ Row value is not unique │
│ 1     ┆ 1   ┆ foo    ┆ A     ┆ tree    ┆ Ipsum   ┆ M         ┆ Row value is not unique │
└───────┴─────┴────────┴───────┴─────────┴─────────┴───────────┴─────────────────────────┘

The report should contain an index and a warning column. I used this line of code to identify if there are any null values in a row:

report = (df.with_row_count("index")
.filter(pl.any(pl.col("*").is_null()) & pl.col("mandatory").eq("M"))
.with_columns(pl.lit("Missing value detected").alias("warning"))
)

How do I need to adapt this code so on the one hand I detect missing values and on the other hand I identify ununique rows. Maybe I create two reports and use .vstack() to combine both reports to a final one. How would you solve it?

答案1

得分: 3

你可以创建一个 struct 并使用 .is_duplicated

df.with_columns(
   warning = pl.struct(["ID", "table", "value_a"]).is_duplicated()
)
shape: (5, 7)
┌─────┬────────┬───────┬─────────┬─────────┬───────────┬─────────┐
 ID  | column | table | value_a | value_b | mandatory | warning 
 --- | ---    | ---   | ---     | ---     | ---       | ---     
╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═════════╡
 1   | foo    | A     | tree    | Lorem   | M         | true    
 1   | foo    | A     | tree    | Ipsum   | M         | true    
 1   | bar    | C     | null    | Dal     | M         | false   
 1   | ham    | D     | bean    | Curry   | CM        | false   
 1   | egg    | E     | null    | Dish    | M         | false   
└─────┴────────┴───────┴─────────┴─────────┴───────────┴─────────┘
英文:

You can create a struct and use .is_duplicated

df.with_columns(
warning = pl.struct(["ID", "table", "value_a"]).is_duplicated()
)
shape: (5, 7)
┌─────┬────────┬───────┬─────────┬─────────┬───────────┬─────────┐
│ ID  | column | table | value_a | value_b | mandatory | warning │
│ --- | ---    | ---   | ---     | ---     | ---       | ---     │
│ str | str    | str   | str     | str     | str       | bool    │
╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═════════╡
│ 1   | foo    | A     | tree    | Lorem   | M         | true    │
│ 1   | foo    | A     | tree    | Ipsum   | M         | true    │
│ 1   | bar    | C     | null    | Dal     | M         | false   │
│ 1   | ham    | D     | bean    | Curry   | CM        | false   │
│ 1   | egg    | E     | null    | Dish    | M         | false   │
└─────┴────────┴───────┴─────────┴─────────┴───────────┴─────────┘

huangapple
  • 本文由 发表于 2023年3月9日 21:01:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/75684994.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定