英文:
How to filter DataFrame by any string column matching any regex pattern in a list?
问题
以下是翻译好的部分:
- String columns can be selected with:
df.select(pl.col(pl.Utf8))
- and a Dataframe's rows can be filtered with a regex pattern for a single column, with something like:
df.filter(pl.col("feature").str.contains("dangerous"))
- How can a DataFrame be filtered with a list of regex patterns that could appear in any string column? I.e., if any string in a row matches any regex pattern, then keep that entire row, discard the rest.
EDIT 1
Here's a generated df
and patterns
to test functionality and performance.
import random
from faker import Faker
import polars as pl
random.seed(42)
Faker.seed(42)
faker = Faker()
df_len = 10000
df = pl.DataFrame(
[
pl.Series("a", [random.randint(0, 511) for _ in range(df_len)]).cast(pl.Binary),
pl.Series("b", [random.randint(0, 1) for _ in range(df_len)]).cast(pl.Boolean),
pl.Series("c", faker.sentences(df_len), pl.Utf8),
pl.Series("d", [random.randint(0, 255) for _ in range(df_len)], pl.UInt8),
pl.Series("e", faker.words(df_len), pl.Utf8),
pl.Series(
"f",
[random.randint(0, 255) * random.TWOPI for _ in range(df_len)],
pl.Float32,
),
pl.Series("g", faker.words(df_len), pl.Utf8),
]
)
patterns = [r"(?i)dangerous", r"always", r"(?i)prevent"]
print(df)
yields:
shape: (10000, 7)
┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬────────────┬──────────┐
│ a ┆ b ┆ c ┆ d ┆ e ┆ f ┆ g │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ binary ┆ bool ┆ str ┆ u8 ┆ str ┆ f32 ┆ str │
╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪════════════╪══════════╡
│ [binary data] ┆ false ┆ Agent every development say. ┆ 164 ┆ let ┆ 980.17688 ┆ yard │
│ [binary data] ┆ true ┆ Beautiful instead ahead despite ... ┆ 210 ┆ reach ┆ 458.672516 ┆ son │
│ [binary data] ┆ false ┆ Information last everything than... ┆ 230 ┆ arm ┆ 50.265484 ┆ standard │
│ [binary data] ┆ false ┆ Choice whatever from behavior be... ┆ 29 ┆ operation ┆ 929.911438 ┆ final │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ [binary data] ┆ true ┆ Building sign recently avoid upo... ┆ 132 ┆ practice ┆ 282.743347 ┆ big │
│ [binary data] ┆ false ┆ Paper will board. ┆ 72 ┆ similar ┆ 376.991119 ┆ just │
│ [binary data] ┆ true ┆ Technology money worker spring m... ┆ 140 ┆ sign ┆ 94.24778 ┆ audience │
│ [binary data] ┆ false ┆ A third traditional ago. ┆ 40 ┆ available ┆ 615.752136 ┆ always │
└───────────────┴───────┴─────────────────────────────────────┴─────┴───────────┴────────────┴──────────┘
EDIT 2
Using @jqurious's answer (the fastest so far), the correct output of df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex)))
is:
shape: (146, 7)
┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬─────────────┬──────────┐
│ a ┆ b ┆ c ┆ d ┆ e ┆ f ┆ g │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ binary ┆ bool ┆ str ┆ u8 ┆ str ┆ f32 ┆ str │
╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪═════════════╪══════════╡
│ [binary data] ┆ true ┆ During prevent accept seem show ... ┆ 137 ┆ various ┆ 471.238892 ┆ customer │
│ [binary data] ┆ true ┆ Ball always it focus economy bef... ┆ 179 ┆ key ┆ 471.238892 ┆ guy │
│ [binary data] ┆ false ┆ Admit attack energy always. ┆ 175 ┆ purpose ┆ 1281.769775 ┆ wonder │
│ [binary data] ┆ false ┆ Beyond prevent entire staff. ┆ 242 ┆ hair ┆ 904.778687 ┆ around │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ [binary data] ┆ true ┆ Your sure piece simple always so... ┆ 247 ┆ recently ┆ 1055.575073 ┆ laugh │
│ [binary data] ┆ false ┆ Difference all machine let charg... ┆ 178 ┆ former ┆ 1061.858276 ┆ always │
│ [binary data] ┆ true ┆ Morning carry event tell prevent... ┆ 3 ┆ entire ┆ 1432.566284 ┆ hit │
<details>
<summary>英文:</summary>
String columns can be selected with:
```lang-python
df.select(pl.col(pl.Utf8))
and a Dataframe's rows can be filtered with a regex pattern for a single column, with something like:
df.filter(pl.col("feature").str.contains("dangerous"))
How can a DataFrame be filtered with a list of regex patterns that could appear in any string column? I.e., if any string in a row matches any regex pattern, then keep that entire row, discard the rest.
EDIT 1
Here's a generated df
and patterns
to test functionality and performance.
import random
from faker import Faker
import polars as pl
random.seed(42)
Faker.seed(42)
faker = Faker()
df_len = 10000
df = pl.DataFrame(
[
pl.Series("a", [random.randint(0, 511) for _ in range(df_len)]).cast(pl.Binary),
pl.Series("b", [random.randint(0, 1) for _ in range(df_len)]).cast(pl.Boolean),
pl.Series("c", faker.sentences(df_len), pl.Utf8),
pl.Series("d", [random.randint(0, 255) for _ in range(df_len)], pl.UInt8),
pl.Series("e", faker.words(df_len), pl.Utf8),
pl.Series(
"f",
[random.randint(0, 255) * random.TWOPI for _ in range(df_len)],
pl.Float32,
),
pl.Series("g", faker.words(df_len), pl.Utf8),
]
)
patterns = [r"(?i)dangerous", r"always", r"(?i)prevent"]
print(df)
yields:
shape: (10000, 7)
┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬────────────┬──────────┐
│ a ┆ b ┆ c ┆ d ┆ e ┆ f ┆ g │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ binary ┆ bool ┆ str ┆ u8 ┆ str ┆ f32 ┆ str │
╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪════════════╪══════════╡
│ [binary data] ┆ false ┆ Agent every development say. ┆ 164 ┆ let ┆ 980.17688 ┆ yard │
│ [binary data] ┆ true ┆ Beautiful instead ahead despite ... ┆ 210 ┆ reach ┆ 458.672516 ┆ son │
│ [binary data] ┆ false ┆ Information last everything than... ┆ 230 ┆ arm ┆ 50.265484 ┆ standard │
│ [binary data] ┆ false ┆ Choice whatever from behavior be... ┆ 29 ┆ operation ┆ 929.911438 ┆ final │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ [binary data] ┆ true ┆ Building sign recently avoid upo... ┆ 132 ┆ practice ┆ 282.743347 ┆ big │
│ [binary data] ┆ false ┆ Paper will board. ┆ 72 ┆ similar ┆ 376.991119 ┆ just │
│ [binary data] ┆ true ┆ Technology money worker spring m... ┆ 140 ┆ sign ┆ 94.24778 ┆ audience │
│ [binary data] ┆ false ┆ A third traditional ago. ┆ 40 ┆ available ┆ 615.752136 ┆ always │
└───────────────┴───────┴─────────────────────────────────────┴─────┴───────────┴────────────┴──────────┘
EDIT 2
Using @jqurious's answer (the fastest so far), the correct output of df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex)))
is:
shape: (146, 7)
┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬─────────────┬──────────┐
│ a ┆ b ┆ c ┆ d ┆ e ┆ f ┆ g │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ binary ┆ bool ┆ str ┆ u8 ┆ str ┆ f32 ┆ str │
╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪═════════════╪══════════╡
│ [binary data] ┆ true ┆ During prevent accept seem show ... ┆ 137 ┆ various ┆ 471.238892 ┆ customer │
│ [binary data] ┆ true ┆ Ball always it focus economy bef... ┆ 179 ┆ key ┆ 471.238892 ┆ guy │
│ [binary data] ┆ false ┆ Admit attack energy always. ┆ 175 ┆ purpose ┆ 1281.769775 ┆ wonder │
│ [binary data] ┆ false ┆ Beyond prevent entire staff. ┆ 242 ┆ hair ┆ 904.778687 ┆ around │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ [binary data] ┆ true ┆ Your sure piece simple always so... ┆ 247 ┆ recently ┆ 1055.575073 ┆ laugh │
│ [binary data] ┆ false ┆ Difference all machine let charg... ┆ 178 ┆ former ┆ 1061.858276 ┆ always │
│ [binary data] ┆ true ┆ Morning carry event tell prevent... ┆ 3 ┆ entire ┆ 1432.566284 ┆ hit │
│ [binary data] ┆ false ┆ A third traditional ago. ┆ 40 ┆ available ┆ 615.752136 ┆ always │
└───────────────┴───────┴─────────────────────────────────────┴─────┴───────────┴─────────────┴──────────┘
答案1
得分: 3
可以将列表转换为单个正则表达式。
df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex)))
英文:
You can turn the list into a single regex.
regex = "|".join(
f"(?:{pattern})" for pattern in
sorted(patterns, key=len, reverse=True)
)
df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex)))
答案2
得分: 1
为了检查每个字符串列的正则表达式模式,您可以使用.fold()
方法:
df = pl.DataFrame({
"a": ["foo", "fo", "foa"],
"b": ["foa", "fo", "foo"]
})
df.filter(
pl.fold(acc=pl.lit(False),
f=lambda acc, col: acc | col.str.contains("foo"),
exprs=pl.col(pl.Utf8))
)
另一种方法是将所有字符串列连接成一个单一列,然后应用正则表达式:
df.filter(
pl.concat_str(pl.col(pl.Utf8)).str.contains("foo")
)
英文:
To check regex patterns for each string column, you can use .fold()
method
df = pl.DataFrame({
"a": ["foo", "fo", "foa"],
"b": ["foa", "fo", "foo"]
})
df.filter(
pl.fold(acc=pl.lit(False),
f=lambda acc, col: acc | col.str.contains("foo"),
exprs=pl.col(pl.Utf8))
)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪═════╡
│ foo ┆ foa │
│ foa ┆ foo │
└─────┴─────┘
Another way to do it - concat all string columns into a single one and then apply regex expressions:
df.filter(
pl.concat_str(pl.col(pl.Utf8)).str.contains("foo")
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论