如何按列表中的任何正则表达式模式匹配任何字符串列来过滤DataFrame?

huangapple go评论64阅读模式
英文:

How to filter DataFrame by any string column matching any regex pattern in a list?

问题

以下是翻译好的部分:

  1. String columns can be selected with:
df.select(pl.col(pl.Utf8))
  1. and a Dataframe's rows can be filtered with a regex pattern for a single column, with something like:
df.filter(pl.col("feature").str.contains("dangerous"))
  1. How can a DataFrame be filtered with a list of regex patterns that could appear in any string column? I.e., if any string in a row matches any regex pattern, then keep that entire row, discard the rest.

EDIT 1

Here's a generated df and patterns to test functionality and performance.

import random
from faker import Faker
import polars as pl

random.seed(42)
Faker.seed(42)

faker = Faker()

df_len = 10000

df = pl.DataFrame(
    [
        pl.Series("a", [random.randint(0, 511) for _ in range(df_len)]).cast(pl.Binary),
        pl.Series("b", [random.randint(0, 1) for _ in range(df_len)]).cast(pl.Boolean),
        pl.Series("c", faker.sentences(df_len), pl.Utf8),
        pl.Series("d", [random.randint(0, 255) for _ in range(df_len)], pl.UInt8),
        pl.Series("e", faker.words(df_len), pl.Utf8),
        pl.Series(
            "f",
            [random.randint(0, 255) * random.TWOPI for _ in range(df_len)],
            pl.Float32,
        ),
        pl.Series("g", faker.words(df_len), pl.Utf8),
    ]
)

patterns = [r"(?i)dangerous", r"always", r"(?i)prevent"]

print(df) yields:

shape: (10000, 7)
┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬────────────┬──────────┐
│ a             ┆ b     ┆ c                                   ┆ d   ┆ e         ┆ f          ┆ g        │
│ ---           ┆ ---   ┆ ---                                 ┆ --- ┆ ---       ┆ ---        ┆ ---      │
│ binary        ┆ bool  ┆ str                                 ┆ u8  ┆ str       ┆ f32        ┆ str      │
╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪════════════╪══════════╡
│ [binary data] ┆ false ┆ Agent every development say.        ┆ 164 ┆ let       ┆ 980.17688  ┆ yard     │
│ [binary data] ┆ true  ┆ Beautiful instead ahead despite ... ┆ 210 ┆ reach     ┆ 458.672516 ┆ son      │
│ [binary data] ┆ false ┆ Information last everything than... ┆ 230 ┆ arm       ┆ 50.265484  ┆ standard │
│ [binary data] ┆ false ┆ Choice whatever from behavior be... ┆ 29  ┆ operation ┆ 929.911438 ┆ final    │
│ ...           ┆ ...   ┆ ...                                 ┆ ... ┆ ...       ┆ ...        ┆ ...      │
│ [binary data] ┆ true  ┆ Building sign recently avoid upo... ┆ 132 ┆ practice  ┆ 282.743347 ┆ big      │
│ [binary data] ┆ false ┆ Paper will board.                   ┆ 72  ┆ similar   ┆ 376.991119 ┆ just     │
│ [binary data] ┆ true  ┆ Technology money worker spring m... ┆ 140 ┆ sign      ┆ 94.24778   ┆ audience │
│ [binary data] ┆ false ┆ A third traditional ago.            ┆ 40  ┆ available ┆ 615.752136 ┆ always   │
└───────────────┴───────┴─────────────────────────────────────┴─────┴───────────┴────────────┴──────────┘

EDIT 2

Using @jqurious's answer (the fastest so far), the correct output of df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex))) is:

shape: (146, 7)
┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬─────────────┬──────────┐
│ a             ┆ b     ┆ c                                   ┆ d   ┆ e         ┆ f           ┆ g        │
│ ---           ┆ ---   ┆ ---                                 ┆ --- ┆ ---       ┆ ---         ┆ ---      │
│ binary        ┆ bool  ┆ str                                 ┆ u8  ┆ str       ┆ f32         ┆ str      │
╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪═════════════╪══════════╡
│ [binary data] ┆ true  ┆ During prevent accept seem show ... ┆ 137 ┆ various   ┆ 471.238892  ┆ customer │
│ [binary data] ┆ true  ┆ Ball always it focus economy bef... ┆ 179 ┆ key       ┆ 471.238892  ┆ guy      │
│ [binary data] ┆ false ┆ Admit attack energy always.         ┆ 175 ┆ purpose   ┆ 1281.769775 ┆ wonder   │
│ [binary data] ┆ false ┆ Beyond prevent entire staff.        ┆ 242 ┆ hair      ┆ 904.778687  ┆ around   │
│ ...           ┆ ...   ┆ ...                                 ┆ ... ┆ ...       ┆ ...         ┆ ...      │
│ [binary data] ┆ true  ┆ Your sure piece simple always so... ┆ 247 ┆ recently  ┆ 1055.575073 ┆ laugh    │
│ [binary data] ┆ false ┆ Difference all machine let charg... ┆ 178 ┆ former    ┆ 1061.858276 ┆ always   │
│ [binary data] ┆ true  ┆ Morning carry event tell prevent... ┆ 3   ┆ entire    ┆ 1432.566284 ┆ hit      │


<details>
<summary>英文:</summary>

String columns can be selected with:
```lang-python
df.select(pl.col(pl.Utf8))

and a Dataframe's rows can be filtered with a regex pattern for a single column, with something like:

df.filter(pl.col(&quot;feature&quot;).str.contains(&quot;dangerous&quot;))

How can a DataFrame be filtered with a list of regex patterns that could appear in any string column? I.e., if any string in a row matches any regex pattern, then keep that entire row, discard the rest.

EDIT 1

Here's a generated df and patterns to test functionality and performance.

import random
from faker import Faker
import polars as pl

random.seed(42)
Faker.seed(42)

faker = Faker()

df_len = 10000

df = pl.DataFrame(
    [
        pl.Series(&quot;a&quot;, [random.randint(0, 511) for _ in range(df_len)]).cast(pl.Binary),
        pl.Series(&quot;b&quot;, [random.randint(0, 1) for _ in range(df_len)]).cast(pl.Boolean),
        pl.Series(&quot;c&quot;, faker.sentences(df_len), pl.Utf8),
        pl.Series(&quot;d&quot;, [random.randint(0, 255) for _ in range(df_len)], pl.UInt8),
        pl.Series(&quot;e&quot;, faker.words(df_len), pl.Utf8),
        pl.Series(
            &quot;f&quot;,
            [random.randint(0, 255) * random.TWOPI for _ in range(df_len)],
            pl.Float32,
        ),
        pl.Series(&quot;g&quot;, faker.words(df_len), pl.Utf8),
    ]
)

patterns = [r&quot;(?i)dangerous&quot;, r&quot;always&quot;, r&quot;(?i)prevent&quot;]

print(df) yields:

shape: (10000, 7)
┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬────────────┬──────────┐
│ a             ┆ b     ┆ c                                   ┆ d   ┆ e         ┆ f          ┆ g        │
│ ---           ┆ ---   ┆ ---                                 ┆ --- ┆ ---       ┆ ---        ┆ ---      │
│ binary        ┆ bool  ┆ str                                 ┆ u8  ┆ str       ┆ f32        ┆ str      │
╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪════════════╪══════════╡
│ [binary data] ┆ false ┆ Agent every development say.        ┆ 164 ┆ let       ┆ 980.17688  ┆ yard     │
│ [binary data] ┆ true  ┆ Beautiful instead ahead despite ... ┆ 210 ┆ reach     ┆ 458.672516 ┆ son      │
│ [binary data] ┆ false ┆ Information last everything than... ┆ 230 ┆ arm       ┆ 50.265484  ┆ standard │
│ [binary data] ┆ false ┆ Choice whatever from behavior be... ┆ 29  ┆ operation ┆ 929.911438 ┆ final    │
│ ...           ┆ ...   ┆ ...                                 ┆ ... ┆ ...       ┆ ...        ┆ ...      │
│ [binary data] ┆ true  ┆ Building sign recently avoid upo... ┆ 132 ┆ practice  ┆ 282.743347 ┆ big      │
│ [binary data] ┆ false ┆ Paper will board.                   ┆ 72  ┆ similar   ┆ 376.991119 ┆ just     │
│ [binary data] ┆ true  ┆ Technology money worker spring m... ┆ 140 ┆ sign      ┆ 94.24778   ┆ audience │
│ [binary data] ┆ false ┆ A third traditional ago.            ┆ 40  ┆ available ┆ 615.752136 ┆ always   │
└───────────────┴───────┴─────────────────────────────────────┴─────┴───────────┴────────────┴──────────┘

EDIT 2

Using @jqurious's answer (the fastest so far), the correct output of df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex))) is:

shape: (146, 7)
┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬─────────────┬──────────┐
│ a             ┆ b     ┆ c                                   ┆ d   ┆ e         ┆ f           ┆ g        │
│ ---           ┆ ---   ┆ ---                                 ┆ --- ┆ ---       ┆ ---         ┆ ---      │
│ binary        ┆ bool  ┆ str                                 ┆ u8  ┆ str       ┆ f32         ┆ str      │
╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪═════════════╪══════════╡
│ [binary data] ┆ true  ┆ During prevent accept seem show ... ┆ 137 ┆ various   ┆ 471.238892  ┆ customer │
│ [binary data] ┆ true  ┆ Ball always it focus economy bef... ┆ 179 ┆ key       ┆ 471.238892  ┆ guy      │
│ [binary data] ┆ false ┆ Admit attack energy always.         ┆ 175 ┆ purpose   ┆ 1281.769775 ┆ wonder   │
│ [binary data] ┆ false ┆ Beyond prevent entire staff.        ┆ 242 ┆ hair      ┆ 904.778687  ┆ around   │
│ ...           ┆ ...   ┆ ...                                 ┆ ... ┆ ...       ┆ ...         ┆ ...      │
│ [binary data] ┆ true  ┆ Your sure piece simple always so... ┆ 247 ┆ recently  ┆ 1055.575073 ┆ laugh    │
│ [binary data] ┆ false ┆ Difference all machine let charg... ┆ 178 ┆ former    ┆ 1061.858276 ┆ always   │
│ [binary data] ┆ true  ┆ Morning carry event tell prevent... ┆ 3   ┆ entire    ┆ 1432.566284 ┆ hit      │
│ [binary data] ┆ false ┆ A third traditional ago.            ┆ 40  ┆ available ┆ 615.752136  ┆ always   │
└───────────────┴───────┴─────────────────────────────────────┴─────┴───────────┴─────────────┴──────────┘

答案1

得分: 3

可以将列表转换为单个正则表达式。
df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex)))
英文:

You can turn the list into a single regex.

regex = &quot;|&quot;.join(
   f&quot;(?:{pattern})&quot; for pattern in 
   sorted(patterns, key=len, reverse=True)
)
df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex)))

答案2

得分: 1

为了检查每个字符串列的正则表达式模式,您可以使用.fold()方法:

df = pl.DataFrame({
    "a": ["foo", "fo", "foa"],
    "b": ["foa", "fo", "foo"]
})

df.filter(
    pl.fold(acc=pl.lit(False),
            f=lambda acc, col: acc | col.str.contains("foo"),
            exprs=pl.col(pl.Utf8))
)

另一种方法是将所有字符串列连接成一个单一列,然后应用正则表达式:

df.filter(
    pl.concat_str(pl.col(pl.Utf8)).str.contains("foo")
)
英文:

To check regex patterns for each string column, you can use .fold() method

df = pl.DataFrame({
    &quot;a&quot;: [&quot;foo&quot;, &quot;fo&quot;, &quot;foa&quot;],
    &quot;b&quot;: [&quot;foa&quot;, &quot;fo&quot;, &quot;foo&quot;]
})

df.filter(
    pl.fold(acc=pl.lit(False),
            f=lambda acc, col: acc | col.str.contains(&quot;foo&quot;),
            exprs=pl.col(pl.Utf8))
)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪═════╡
│ foo ┆ foa │
│ foa ┆ foo │
└─────┴─────┘

Another way to do it - concat all string columns into a single one and then apply regex expressions:

df.filter(
    pl.concat_str(pl.col(pl.Utf8)).str.contains(&quot;foo&quot;)
)

huangapple
  • 本文由 发表于 2023年2月16日 15:46:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75469194.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定