2023年2月16日 15:46:49go评论94阅读模式

英文:

How to filter DataFrame by any string column matching any regex pattern in a list?

问题

以下是翻译好的部分：

String columns can be selected with:

df.select(pl.col(pl.Utf8))

and a Dataframe's rows can be filtered with a regex pattern for a single column, with something like:

df.filter(pl.col("feature").str.contains("dangerous"))

How can a DataFrame be filtered with a list of regex patterns that could appear in any string column? I.e., if any string in a row matches any regex pattern, then keep that entire row, discard the rest.

EDIT 1

Here's a generated df and patterns to test functionality and performance.

import random
from faker import Faker
import polars as pl
random.seed(42)
Faker.seed(42)
faker = Faker()
df_len = 10000
df = pl.DataFrame(
    [
        pl.Series("a", [random.randint(0, 511) for _ in range(df_len)]).cast(pl.Binary),
        pl.Series("b", [random.randint(0, 1) for _ in range(df_len)]).cast(pl.Boolean),
        pl.Series("c", faker.sentences(df_len), pl.Utf8),
        pl.Series("d", [random.randint(0, 255) for _ in range(df_len)], pl.UInt8),
        pl.Series("e", faker.words(df_len), pl.Utf8),
        pl.Series(
            "f",
            [random.randint(0, 255) * random.TWOPI for _ in range(df_len)],
            pl.Float32,
        ),
        pl.Series("g", faker.words(df_len), pl.Utf8),
    ]
)
patterns = [r"(?i)dangerous", r"always", r"(?i)prevent"]

print(df) yields:

shape: (10000, 7)
┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬────────────┬──────────┐
│ a             ┆ b     ┆ c                                   ┆ d   ┆ e         ┆ f          ┆ g        │
│ ---           ┆ ---   ┆ ---                                 ┆ --- ┆ ---       ┆ ---        ┆ ---      │
│ binary        ┆ bool  ┆ str                                 ┆ u8  ┆ str       ┆ f32        ┆ str      │
╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪════════════╪══════════╡
│ [binary data] ┆ false ┆ Agent every development say.        ┆ 164 ┆ let       ┆ 980.17688  ┆ yard     │
│ [binary data] ┆ true  ┆ Beautiful instead ahead despite ... ┆ 210 ┆ reach     ┆ 458.672516 ┆ son      │
│ [binary data] ┆ false ┆ Information last everything than... ┆ 230 ┆ arm       ┆ 50.265484  ┆ standard │
│ [binary data] ┆ false ┆ Choice whatever from behavior be... ┆ 29  ┆ operation ┆ 929.911438 ┆ final    │
│ ...           ┆ ...   ┆ ...                                 ┆ ... ┆ ...       ┆ ...        ┆ ...      │
│ [binary data] ┆ true  ┆ Building sign recently avoid upo... ┆ 132 ┆ practice  ┆ 282.743347 ┆ big      │
│ [binary data] ┆ false ┆ Paper will board.                   ┆ 72  ┆ similar   ┆ 376.991119 ┆ just     │
│ [binary data] ┆ true  ┆ Technology money worker spring m... ┆ 140 ┆ sign      ┆ 94.24778   ┆ audience │
│ [binary data] ┆ false ┆ A third traditional ago.            ┆ 40  ┆ available ┆ 615.752136 ┆ always   │
└───────────────┴───────┴─────────────────────────────────────┴─────┴───────────┴────────────┴──────────┘

EDIT 2

Using @jqurious's answer (the fastest so far), the correct output of df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex))) is:

shape: (146, 7)
┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬─────────────┬──────────┐
│ a             ┆ b     ┆ c                                   ┆ d   ┆ e         ┆ f           ┆ g        │
│ ---           ┆ ---   ┆ ---                                 ┆ --- ┆ ---       ┆ ---         ┆ ---      │
│ binary        ┆ bool  ┆ str                                 ┆ u8  ┆ str       ┆ f32         ┆ str      │
╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪═════════════╪══════════╡
│ [binary data] ┆ true  ┆ During prevent accept seem show ... ┆ 137 ┆ various   ┆ 471.238892  ┆ customer │
│ [binary data] ┆ true  ┆ Ball always it focus economy bef... ┆ 179 ┆ key       ┆ 471.238892  ┆ guy      │
│ [binary data] ┆ false ┆ Admit attack energy always.         ┆ 175 ┆ purpose   ┆ 1281.769775 ┆ wonder   │
│ [binary data] ┆ false ┆ Beyond prevent entire staff.        ┆ 242 ┆ hair      ┆ 904.778687  ┆ around   │
│ ...           ┆ ...   ┆ ...                                 ┆ ... ┆ ...       ┆ ...         ┆ ...      │
│ [binary data] ┆ true  ┆ Your sure piece simple always so... ┆ 247 ┆ recently  ┆ 1055.575073 ┆ laugh    │
│ [binary data] ┆ false ┆ Difference all machine let charg... ┆ 178 ┆ former    ┆ 1061.858276 ┆ always   │
│ [binary data] ┆ true  ┆ Morning carry event tell prevent... ┆ 3   ┆ entire    ┆ 1432.566284 ┆ hit      │
<details>
<summary>英文:</summary>
String columns can be selected with:
```lang-python
df.select(pl.col(pl.Utf8))

and a Dataframe's rows can be filtered with a regex pattern for a single column, with something like:

df.filter(pl.col(&quot;feature&quot;).str.contains(&quot;dangerous&quot;))

How can a DataFrame be filtered with a list of regex patterns that could appear in any string column? I.e., if any string in a row matches any regex pattern, then keep that entire row, discard the rest.

EDIT 1

Here's a generated df and patterns to test functionality and performance.

import random
from faker import Faker
import polars as pl
random.seed(42)
Faker.seed(42)
faker = Faker()
df_len = 10000
df = pl.DataFrame(
    [
        pl.Series(&quot;a&quot;, [random.randint(0, 511) for _ in range(df_len)]).cast(pl.Binary),
        pl.Series(&quot;b&quot;, [random.randint(0, 1) for _ in range(df_len)]).cast(pl.Boolean),
        pl.Series(&quot;c&quot;, faker.sentences(df_len), pl.Utf8),
        pl.Series(&quot;d&quot;, [random.randint(0, 255) for _ in range(df_len)], pl.UInt8),
        pl.Series(&quot;e&quot;, faker.words(df_len), pl.Utf8),
        pl.Series(
            &quot;f&quot;,
            [random.randint(0, 255) * random.TWOPI for _ in range(df_len)],
            pl.Float32,
        ),
        pl.Series(&quot;g&quot;, faker.words(df_len), pl.Utf8),
    ]
)
patterns = [r&quot;(?i)dangerous&quot;, r&quot;always&quot;, r&quot;(?i)prevent&quot;]

print(df) yields:

shape: (10000, 7)
┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬────────────┬──────────┐
│ a             ┆ b     ┆ c                                   ┆ d   ┆ e         ┆ f          ┆ g        │
│ ---           ┆ ---   ┆ ---                                 ┆ --- ┆ ---       ┆ ---        ┆ ---      │
│ binary        ┆ bool  ┆ str                                 ┆ u8  ┆ str       ┆ f32        ┆ str      │
╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪════════════╪══════════╡
│ [binary data] ┆ false ┆ Agent every development say.        ┆ 164 ┆ let       ┆ 980.17688  ┆ yard     │
│ [binary data] ┆ true  ┆ Beautiful instead ahead despite ... ┆ 210 ┆ reach     ┆ 458.672516 ┆ son      │
│ [binary data] ┆ false ┆ Information last everything than... ┆ 230 ┆ arm       ┆ 50.265484  ┆ standard │
│ [binary data] ┆ false ┆ Choice whatever from behavior be... ┆ 29  ┆ operation ┆ 929.911438 ┆ final    │
│ ...           ┆ ...   ┆ ...                                 ┆ ... ┆ ...       ┆ ...        ┆ ...      │
│ [binary data] ┆ true  ┆ Building sign recently avoid upo... ┆ 132 ┆ practice  ┆ 282.743347 ┆ big      │
│ [binary data] ┆ false ┆ Paper will board.                   ┆ 72  ┆ similar   ┆ 376.991119 ┆ just     │
│ [binary data] ┆ true  ┆ Technology money worker spring m... ┆ 140 ┆ sign      ┆ 94.24778   ┆ audience │
│ [binary data] ┆ false ┆ A third traditional ago.            ┆ 40  ┆ available ┆ 615.752136 ┆ always   │
└───────────────┴───────┴─────────────────────────────────────┴─────┴───────────┴────────────┴──────────┘

EDIT 2

Using @jqurious's answer (the fastest so far), the correct output of df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex))) is:

shape: (146, 7)
┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬─────────────┬──────────┐
│ a             ┆ b     ┆ c                                   ┆ d   ┆ e         ┆ f           ┆ g        │
│ ---           ┆ ---   ┆ ---                                 ┆ --- ┆ ---       ┆ ---         ┆ ---      │
│ binary        ┆ bool  ┆ str                                 ┆ u8  ┆ str       ┆ f32         ┆ str      │
╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪═════════════╪══════════╡
│ [binary data] ┆ true  ┆ During prevent accept seem show ... ┆ 137 ┆ various   ┆ 471.238892  ┆ customer │
│ [binary data] ┆ true  ┆ Ball always it focus economy bef... ┆ 179 ┆ key       ┆ 471.238892  ┆ guy      │
│ [binary data] ┆ false ┆ Admit attack energy always.         ┆ 175 ┆ purpose   ┆ 1281.769775 ┆ wonder   │
│ [binary data] ┆ false ┆ Beyond prevent entire staff.        ┆ 242 ┆ hair      ┆ 904.778687  ┆ around   │
│ ...           ┆ ...   ┆ ...                                 ┆ ... ┆ ...       ┆ ...         ┆ ...      │
│ [binary data] ┆ true  ┆ Your sure piece simple always so... ┆ 247 ┆ recently  ┆ 1055.575073 ┆ laugh    │
│ [binary data] ┆ false ┆ Difference all machine let charg... ┆ 178 ┆ former    ┆ 1061.858276 ┆ always   │
│ [binary data] ┆ true  ┆ Morning carry event tell prevent... ┆ 3   ┆ entire    ┆ 1432.566284 ┆ hit      │
│ [binary data] ┆ false ┆ A third traditional ago.            ┆ 40  ┆ available ┆ 615.752136  ┆ always   │
└───────────────┴───────┴─────────────────────────────────────┴─────┴───────────┴─────────────┴──────────┘

答案1

得分: 3

可以将列表转换为单个正则表达式。

df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex)))

英文:

You can turn the list into a single regex.

regex = &quot;|&quot;.join(
   f&quot;(?:{pattern})&quot; for pattern in 
   sorted(patterns, key=len, reverse=True)
)

df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex)))

答案2

得分: 1

为了检查每个字符串列的正则表达式模式，您可以使用.fold()方法：

df = pl.DataFrame({
    "a": ["foo", "fo", "foa"],
    "b": ["foa", "fo", "foo"]
})
df.filter(
    pl.fold(acc=pl.lit(False),
            f=lambda acc, col: acc | col.str.contains("foo"),
            exprs=pl.col(pl.Utf8))
)

另一种方法是将所有字符串列连接成一个单一列，然后应用正则表达式：

df.filter(
    pl.concat_str(pl.col(pl.Utf8)).str.contains("foo")
)

英文:

To check regex patterns for each string column, you can use .fold() method

df = pl.DataFrame({
    &quot;a&quot;: [&quot;foo&quot;, &quot;fo&quot;, &quot;foa&quot;],
    &quot;b&quot;: [&quot;foa&quot;, &quot;fo&quot;, &quot;foo&quot;]
})
df.filter(
    pl.fold(acc=pl.lit(False),
            f=lambda acc, col: acc | col.str.contains(&quot;foo&quot;),
            exprs=pl.col(pl.Utf8))
)

┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪═════╡
│ foo ┆ foa │
│ foa ┆ foo │
└─────┴─────┘

Another way to do it - concat all string columns into a single one and then apply regex expressions:

df.filter(
    pl.concat_str(pl.col(pl.Utf8)).str.contains(&quot;foo&quot;)
)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何按列表中的任何正则表达式模式匹配任何字符串列来过滤DataFrame？

问题

答案1

答案2

应用程序在PyCharm中运行，但在命令行中不运行（ModuleNotFoundError）。

在Linux上安装tensorflow-decision-forests的问题

奇怪的行为：在Python中将整数乘以负幂。

如何在Python的tkinter中从链接菜单项传递值？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。