如何按列表中的任何正则表达式模式匹配任何字符串列来过滤DataFrame?

huangapple go评论94阅读模式
英文:

How to filter DataFrame by any string column matching any regex pattern in a list?

问题

以下是翻译好的部分:

  1. String columns can be selected with:
  1. df.select(pl.col(pl.Utf8))
  1. and a Dataframe's rows can be filtered with a regex pattern for a single column, with something like:
  1. df.filter(pl.col("feature").str.contains("dangerous"))
  1. How can a DataFrame be filtered with a list of regex patterns that could appear in any string column? I.e., if any string in a row matches any regex pattern, then keep that entire row, discard the rest.

EDIT 1

Here's a generated df and patterns to test functionality and performance.

  1. import random
  2. from faker import Faker
  3. import polars as pl
  4. random.seed(42)
  5. Faker.seed(42)
  6. faker = Faker()
  7. df_len = 10000
  8. df = pl.DataFrame(
  9. [
  10. pl.Series("a", [random.randint(0, 511) for _ in range(df_len)]).cast(pl.Binary),
  11. pl.Series("b", [random.randint(0, 1) for _ in range(df_len)]).cast(pl.Boolean),
  12. pl.Series("c", faker.sentences(df_len), pl.Utf8),
  13. pl.Series("d", [random.randint(0, 255) for _ in range(df_len)], pl.UInt8),
  14. pl.Series("e", faker.words(df_len), pl.Utf8),
  15. pl.Series(
  16. "f",
  17. [random.randint(0, 255) * random.TWOPI for _ in range(df_len)],
  18. pl.Float32,
  19. ),
  20. pl.Series("g", faker.words(df_len), pl.Utf8),
  21. ]
  22. )
  23. patterns = [r"(?i)dangerous", r"always", r"(?i)prevent"]

print(df) yields:

  1. shape: (10000, 7)
  2. ┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬────────────┬──────────┐
  3. a b c d e f g
  4. --- --- --- --- --- --- ---
  5. binary bool str u8 str f32 str
  6. ╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪════════════╪══════════╡
  7. [binary data] false Agent every development say. 164 let 980.17688 yard
  8. [binary data] true Beautiful instead ahead despite ... 210 reach 458.672516 son
  9. [binary data] false Information last everything than... 230 arm 50.265484 standard
  10. [binary data] false Choice whatever from behavior be... 29 operation 929.911438 final
  11. ... ... ... ... ... ... ...
  12. [binary data] true Building sign recently avoid upo... 132 practice 282.743347 big
  13. [binary data] false Paper will board. 72 similar 376.991119 just
  14. [binary data] true Technology money worker spring m... 140 sign 94.24778 audience
  15. [binary data] false A third traditional ago. 40 available 615.752136 always
  16. └───────────────┴───────┴─────────────────────────────────────┴─────┴───────────┴────────────┴──────────┘

EDIT 2

Using @jqurious's answer (the fastest so far), the correct output of df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex))) is:

  1. shape: (146, 7)
  2. ┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬─────────────┬──────────┐
  3. a b c d e f g
  4. --- --- --- --- --- --- ---
  5. binary bool str u8 str f32 str
  6. ╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪═════════════╪══════════╡
  7. [binary data] true During prevent accept seem show ... 137 various 471.238892 customer
  8. [binary data] true Ball always it focus economy bef... 179 key 471.238892 guy
  9. [binary data] false Admit attack energy always. 175 purpose 1281.769775 wonder
  10. [binary data] false Beyond prevent entire staff. 242 hair 904.778687 around
  11. ... ... ... ... ... ... ...
  12. [binary data] true Your sure piece simple always so... 247 recently 1055.575073 laugh
  13. [binary data] false Difference all machine let charg... 178 former 1061.858276 always
  14. [binary data] true Morning carry event tell prevent... 3 entire 1432.566284 hit
  15. <details>
  16. <summary>英文:</summary>
  17. String columns can be selected with:
  18. ```lang-python
  19. df.select(pl.col(pl.Utf8))

and a Dataframe's rows can be filtered with a regex pattern for a single column, with something like:

  1. df.filter(pl.col(&quot;feature&quot;).str.contains(&quot;dangerous&quot;))

How can a DataFrame be filtered with a list of regex patterns that could appear in any string column? I.e., if any string in a row matches any regex pattern, then keep that entire row, discard the rest.

EDIT 1

Here's a generated df and patterns to test functionality and performance.

  1. import random
  2. from faker import Faker
  3. import polars as pl
  4. random.seed(42)
  5. Faker.seed(42)
  6. faker = Faker()
  7. df_len = 10000
  8. df = pl.DataFrame(
  9. [
  10. pl.Series(&quot;a&quot;, [random.randint(0, 511) for _ in range(df_len)]).cast(pl.Binary),
  11. pl.Series(&quot;b&quot;, [random.randint(0, 1) for _ in range(df_len)]).cast(pl.Boolean),
  12. pl.Series(&quot;c&quot;, faker.sentences(df_len), pl.Utf8),
  13. pl.Series(&quot;d&quot;, [random.randint(0, 255) for _ in range(df_len)], pl.UInt8),
  14. pl.Series(&quot;e&quot;, faker.words(df_len), pl.Utf8),
  15. pl.Series(
  16. &quot;f&quot;,
  17. [random.randint(0, 255) * random.TWOPI for _ in range(df_len)],
  18. pl.Float32,
  19. ),
  20. pl.Series(&quot;g&quot;, faker.words(df_len), pl.Utf8),
  21. ]
  22. )
  23. patterns = [r&quot;(?i)dangerous&quot;, r&quot;always&quot;, r&quot;(?i)prevent&quot;]

print(df) yields:

  1. shape: (10000, 7)
  2. ┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬────────────┬──────────┐
  3. a b c d e f g
  4. --- --- --- --- --- --- ---
  5. binary bool str u8 str f32 str
  6. ╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪════════════╪══════════╡
  7. [binary data] false Agent every development say. 164 let 980.17688 yard
  8. [binary data] true Beautiful instead ahead despite ... 210 reach 458.672516 son
  9. [binary data] false Information last everything than... 230 arm 50.265484 standard
  10. [binary data] false Choice whatever from behavior be... 29 operation 929.911438 final
  11. ... ... ... ... ... ... ...
  12. [binary data] true Building sign recently avoid upo... 132 practice 282.743347 big
  13. [binary data] false Paper will board. 72 similar 376.991119 just
  14. [binary data] true Technology money worker spring m... 140 sign 94.24778 audience
  15. [binary data] false A third traditional ago. 40 available 615.752136 always
  16. └───────────────┴───────┴─────────────────────────────────────┴─────┴───────────┴────────────┴──────────┘

EDIT 2

Using @jqurious's answer (the fastest so far), the correct output of df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex))) is:

  1. shape: (146, 7)
  2. ┌───────────────┬───────┬─────────────────────────────────────┬─────┬───────────┬─────────────┬──────────┐
  3. a b c d e f g
  4. --- --- --- --- --- --- ---
  5. binary bool str u8 str f32 str
  6. ╞═══════════════╪═══════╪═════════════════════════════════════╪═════╪═══════════╪═════════════╪══════════╡
  7. [binary data] true During prevent accept seem show ... 137 various 471.238892 customer
  8. [binary data] true Ball always it focus economy bef... 179 key 471.238892 guy
  9. [binary data] false Admit attack energy always. 175 purpose 1281.769775 wonder
  10. [binary data] false Beyond prevent entire staff. 242 hair 904.778687 around
  11. ... ... ... ... ... ... ...
  12. [binary data] true Your sure piece simple always so... 247 recently 1055.575073 laugh
  13. [binary data] false Difference all machine let charg... 178 former 1061.858276 always
  14. [binary data] true Morning carry event tell prevent... 3 entire 1432.566284 hit
  15. [binary data] false A third traditional ago. 40 available 615.752136 always
  16. └───────────────┴───────┴─────────────────────────────────────┴─────┴───────────┴─────────────┴──────────┘

答案1

得分: 3

  1. 可以将列表转换为单个正则表达式。
  1. df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex)))
英文:

You can turn the list into a single regex.

  1. regex = &quot;|&quot;.join(
  2. f&quot;(?:{pattern})&quot; for pattern in
  3. sorted(patterns, key=len, reverse=True)
  4. )
  1. df.filter(pl.any(pl.col(pl.Utf8).str.contains(regex)))

答案2

得分: 1

为了检查每个字符串列的正则表达式模式,您可以使用.fold()方法:

  1. df = pl.DataFrame({
  2. "a": ["foo", "fo", "foa"],
  3. "b": ["foa", "fo", "foo"]
  4. })
  5. df.filter(
  6. pl.fold(acc=pl.lit(False),
  7. f=lambda acc, col: acc | col.str.contains("foo"),
  8. exprs=pl.col(pl.Utf8))
  9. )

另一种方法是将所有字符串列连接成一个单一列,然后应用正则表达式:

  1. df.filter(
  2. pl.concat_str(pl.col(pl.Utf8)).str.contains("foo")
  3. )
英文:

To check regex patterns for each string column, you can use .fold() method

  1. df = pl.DataFrame({
  2. &quot;a&quot;: [&quot;foo&quot;, &quot;fo&quot;, &quot;foa&quot;],
  3. &quot;b&quot;: [&quot;foa&quot;, &quot;fo&quot;, &quot;foo&quot;]
  4. })
  5. df.filter(
  6. pl.fold(acc=pl.lit(False),
  7. f=lambda acc, col: acc | col.str.contains(&quot;foo&quot;),
  8. exprs=pl.col(pl.Utf8))
  9. )
  1. ┌─────┬─────┐
  2. a b
  3. --- ---
  4. str str
  5. ╞═════╪═════╡
  6. foo foa
  7. foa foo
  8. └─────┴─────┘

Another way to do it - concat all string columns into a single one and then apply regex expressions:

  1. df.filter(
  2. pl.concat_str(pl.col(pl.Utf8)).str.contains(&quot;foo&quot;)
  3. )

huangapple
  • 本文由 发表于 2023年2月16日 15:46:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75469194.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定