2023年3月9日 21:01:30go评论137阅读模式

英文:

Detect rows where uniqueness is not given in polars

问题

I understand that you want to translate the provided code and text. Here is the translated content:

目前我有以下问题。我需要检查列值`ID`，`table`和`value_a`是否不唯一时是否存在侵权。
        df = pl.DataFrame(
            {
            &quot;ID&quot;: [&quot;1&quot;, &quot;1&quot;, &quot;1&quot;, &quot;1&quot;, &quot;1&quot;],
            &quot;column&quot;: [&quot;foo&quot;, &quot;foo&quot;, &quot;bar&quot;, &quot;ham&quot;, &quot;egg&quot;],
            &quot;table&quot;: [&quot;A&quot;, &quot;A&quot;, &quot;C&quot;, &quot;D&quot;, &quot;E&quot;],
            &quot;value_a&quot;: [&quot;tree&quot;, tree, None, &quot;bean&quot;, None,],
            &quot;value_b&quot;: [&quot;Lorem&quot;, &quot;Ipsum&quot;, &quot;Dal&quot;, &quot;Curry&quot;, &quot;Dish&quot;,],
            &quot;mandatory&quot;: [&quot;M&quot;, &quot;M&quot;, &quot;M&quot;, &quot;CM&quot;, &quot;M&quot;]
            }
        )
        print(df)
        shape: (5, 6)
        ┌─────┬────────┬───────┬─────────┬─────────┬───────────┐
        │ ID  ┆ column ┆ table ┆ value_a ┆ value_b ┆ mandatory │
        │ --- ┆ ---    ┆ ---   ┆ ---     ┆ ---     ┆ ---       │
        │ str ┆ str    ┆ str   ┆ str     ┆ str     ┆ str       │
        ╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╡
        │ 1   ┆ foo    ┆ A     ┆ tree    ┆ Lorem   ┆ M         │
        │ 1   ┆ foo    ┆ B     ┆ tree    ┆ Ipsum   ┆ M         │
        │ 1   ┆ bar    ┆ C     ┆ null    ┆ Dal     ┆ M         │
        │ 1   ┆ ham    ┆ D     ┆ bean    ┆ Curry   ┆ M         │
        │ 1   ┆ egg    ┆ E     ┆ null    ┆ Dish    ┆ M         │
        └─────┴────────┴───────┴─────────┴─────────┴───────────┘
对于df，应创建侵权报告，其中包含以下专用输出：
    shape: (2, 8)
    ┌───────┬─────┬────────┬───────┬─────────┬─────────┬───────────┬─────────────────────────┐
    │ index ┆ ID  ┆ column ┆ table ┆ value_a ┆ value_b ┆ mandatory ┆ warning                 │
    │ ---   ┆ --- ┆ ---    ┆ ---   ┆ ---     ┆ ---     ┆ ---       ┆ ---                     │
    │ i64   ┆ str ┆ str    ┆ str   ┆ str     ┆ str     ┆ str       ┆ str                     │
    ╞═══════╪═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═════════════════════════╡
    │ 0     ┆ 1   ┆ foo    ┆ A     ┆ tree    ┆ Lorem   ┆ M         ┆ 行值不唯一            │
    │ 1     ┆ 1   ┆ foo    ┆ A     ┆ tree    ┆ Ipsum   ┆ M         ┆ 行值不唯一            │
    └───────┴─────┴────────┴───────┴─────────┴─────────┴───────────┴─────────────────────────┘
报告应包含一个`index`和一个`warning`列。我使用以下代码行来识别行中是否有任何空值：
    report = (df.with_row_count(&quot;index&quot;)
                    .filter(pl.any(pl.col(&quot;*&quot;).is_null()) &amp; pl.col(&quot;mandatory&quot;).eq(&quot;M&quot;))
                    .with_columns(pl.lit(&quot;检测到缺失值&quot;).alias(&quot;warning&quot;))
             )
如何调整此代码以便在一方面检测缺失值，另一方面识别不唯一的行？也许我可以创建两个报告，然后使用`.vstack()`将两个报告组合成最终报告。您会如何解决这个问题？

英文:

Currently I have the following problem. I have to check if there is an infringement if the column values ID, table and value_a are not unique.

    df = pl.DataFrame(
{
&quot;ID&quot;: [&quot;1&quot;, &quot;1&quot;, &quot;1&quot;, &quot;1&quot;, &quot;1&quot;],
&quot;column&quot;: [&quot;foo&quot;, &quot;foo&quot;, &quot;bar&quot;, &quot;ham&quot;, &quot;egg&quot;],
&quot;table&quot;: [&quot;A&quot;, &quot;A&quot;, &quot;C&quot;, &quot;D&quot;, &quot;E&quot;],
&quot;value_a&quot;: [&quot;tree&quot;, tree, None, &quot;bean&quot;, None,],
&quot;value_b&quot;: [&quot;Lorem&quot;, &quot;Ipsum&quot;, &quot;Dal&quot;, &quot;Curry&quot;, &quot;Dish&quot;,],
&quot;mandatory&quot;: [&quot;M&quot;, &quot;M&quot;, &quot;M&quot;, &quot;CM&quot;, &quot;M&quot;]
}
)
print(df)
shape: (5, 6)
┌─────┬────────┬───────┬─────────┬─────────┬───────────┐
│ ID  ┆ column ┆ table ┆ value_a ┆ value_b ┆ mandatory │
│ --- ┆ ---    ┆ ---   ┆ ---     ┆ ---     ┆ ---       │
│ str ┆ str    ┆ str   ┆ str     ┆ str     ┆ str       │
╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╡
│ 1   ┆ foo    ┆ A     ┆ tree    ┆ Lorem   ┆ M         │
│ 1   ┆ foo    ┆ B     ┆ tree    ┆ Ipsum   ┆ M         │
│ 1   ┆ bar    ┆ C     ┆ null    ┆ Dal     ┆ M         │
│ 1   ┆ ham    ┆ D     ┆ bean    ┆ Curry   ┆ M         │
│ 1   ┆ egg    ┆ E     ┆ null    ┆ Dish    ┆ M         │
└─────┴────────┴───────┴─────────┴─────────┴───────────┘

In the case of df a infringement report should be created with the following dedicated output:

shape: (2, 8)
┌───────┬─────┬────────┬───────┬─────────┬─────────┬───────────┬─────────────────────────┐
│ index ┆ ID  ┆ column ┆ table ┆ value_a ┆ value_b ┆ mandatory ┆ warning                 │
│ ---   ┆ --- ┆ ---    ┆ ---   ┆ ---     ┆ ---     ┆ ---       ┆ ---                     │
│ i64   ┆ str ┆ str    ┆ str   ┆ str     ┆ str     ┆ str       ┆ str                     │
╞═══════╪═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═════════════════════════╡
│ 0     ┆ 1   ┆ foo    ┆ A     ┆ tree    ┆ Lorem   ┆ M         ┆ Row value is not unique │
│ 1     ┆ 1   ┆ foo    ┆ A     ┆ tree    ┆ Ipsum   ┆ M         ┆ Row value is not unique │
└───────┴─────┴────────┴───────┴─────────┴─────────┴───────────┴─────────────────────────┘

The report should contain an index and a warning column. I used this line of code to identify if there are any null values in a row:

report = (df.with_row_count(&quot;index&quot;)
.filter(pl.any(pl.col(&quot;*&quot;).is_null()) &amp; pl.col(&quot;mandatory&quot;).eq(&quot;M&quot;))
.with_columns(pl.lit(&quot;Missing value detected&quot;).alias(&quot;warning&quot;))
)

How do I need to adapt this code so on the one hand I detect missing values and on the other hand I identify ununique rows. Maybe I create two reports and use .vstack() to combine both reports to a final one. How would you solve it?

答案1

得分: 3

你可以创建一个 struct 并使用 .is_duplicated

df.with_columns(
   warning = pl.struct(["ID", "table", "value_a"]).is_duplicated()
)

shape: (5, 7)
┌─────┬────────┬───────┬─────────┬─────────┬───────────┬─────────┐
│ ID  | column | table | value_a | value_b | mandatory | warning │
│ --- | ---    | ---   | ---     | ---     | ---       | ---     │
╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═════════╡
│ 1   | foo    | A     | tree    | Lorem   | M         | true    │
│ 1   | foo    | A     | tree    | Ipsum   | M         | true    │
│ 1   | bar    | C     | null    | Dal     | M         | false   │
│ 1   | ham    | D     | bean    | Curry   | CM        | false   │
│ 1   | egg    | E     | null    | Dish    | M         | false   │
└─────┴────────┴───────┴─────────┴─────────┴───────────┴─────────┘

英文:

You can create a struct and use .is_duplicated

df.with_columns(
warning = pl.struct([&quot;ID&quot;, &quot;table&quot;, &quot;value_a&quot;]).is_duplicated()
)

shape: (5, 7)
┌─────┬────────┬───────┬─────────┬─────────┬───────────┬─────────┐
│ ID  | column | table | value_a | value_b | mandatory | warning │
│ --- | ---    | ---   | ---     | ---     | ---       | ---     │
│ str | str    | str   | str     | str     | str       | bool    │
╞═════╪════════╪═══════╪═════════╪═════════╪═══════════╪═════════╡
│ 1   | foo    | A     | tree    | Lorem   | M         | true    │
│ 1   | foo    | A     | tree    | Ipsum   | M         | true    │
│ 1   | bar    | C     | null    | Dal     | M         | false   │
│ 1   | ham    | D     | bean    | Curry   | CM        | false   │
│ 1   | egg    | E     | null    | Dish    | M         | false   │
└─────┴────────┴───────┴─────────┴─────────┴───────────┴─────────┘

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

检测在Polars中未给定唯一性的行。

问题

答案1

在 pandas 中，如何按照自定义规则对列按值进行分组排序。

在字典定义中结合理解和键-值列表是否可能？

这段代码为什么在Jupyter Notebook上运行正常，但在新的Python文件上不起作用？

抓取隐藏页面，如果搜索结果多于显示的结果。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。