Pyspark的regexp_extract无法识别’=’作为一个字符?

huangapple go评论96阅读模式
英文:

Pyspark regexp_extract does not recognize '=' as a character?

问题

I see your code and the regex issue. To fix it for Pyspark, you can try using a backslash to escape the equals sign, like this:

  1. my_regex = r'text.csv\?[a-z]+\=[a-zA-Z0-9]{10,25}$'

This should work with Pyspark to match URLs with the format you described.

英文:

I am trying to write a regular expression to match the urls that have text.csv, followed by a single letter parameter, followed by a set of characters that is between 10-25 characters.

I have this expression, which works with pandas, but does not work in Pyspark..

text.csv\?[a-z]+=[a-zA-Z0-9]{10,25}$

I have found that the problem is with the = sign, but I am not sure how to fix it. Here is a reproducible example.

  1. from pyspark.sql import functions, types
  2. data2 = [("http://daasd.com/text.csv?c=uss1zhv1imikb4w", 2),
  3. ("http://oasnd.com/car.csv?c=913fh7n83n19ms98", 4),
  4. ("http://dunfdas.com/bread.csv?c=4968698835", 8),
  5. ("http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs", 7),
  6. ("http://daosj.com/text.csv?c=h7hgk1r3o3", 1),
  7. ("http://daosj.com/text.csv?c=h7hg", 1),
  8. ]
  9. schema = types.StructType([ \
  10. types.StructField("url", types.StringType(),True),\
  11. types.StructField("val", types.IntegerType(),True),
  12. ])
  13. df = spark.createDataFrame(data=data2, schema=schema)
  14. my_regex = r'text.csv\?[a-z]+=[a-zA-Z0-9]{10,25}$'
  15. df = df.withColumn('is_match', functions.expr(f"regexp_extract(url, '{my_regex}', 0) != ''"))
  16. df.show(truncate=False)
  1. # --- Result ---
  2. +--------------------------------------------+---+--------+
  3. |url |val|is_match|
  4. +--------------------------------------------+---+--------+
  5. |http://daasd.com/text.csv?c=uss1zhv1imikb4w |2 |false |
  6. |http://oasnd.com/car.csv?c=913fh7n83n19ms98 |4 |false |
  7. |http://dunfdas.com/bread.csv?c=4968698835 |8 |false |
  8. |http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs|7 |false |
  9. |http://daosj.com/text.csv?c=h7hgk1r3o3 |1 |false |
  10. |http://daosj.com/text.csv?c=h7hg |1 |false |
  11. +--------------------------------------------+---+--------+
  12. # -- Desired Result --
  13. +--------------------------------------------+---+--------+
  14. |url |val|is_match|
  15. +--------------------------------------------+---+--------+
  16. |http://daasd.com/text.csv?c=uss1zhv1imikb4w |2 |True |
  17. |http://oasnd.com/car.csv?c=913fh7n83n19ms98 |4 |false |
  18. |http://dunfdas.com/bread.csv?c=4968698835 |8 |false |
  19. |http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs|7 |false |
  20. |http://daosj.com/text.csv?c=h7hgk1r3o3 |1 |True |
  21. |http://daosj.com/text.csv?c=h7hg |1 |false |
  22. +--------------------------------------------+---+--------+

Any ideas?

Spark version: 3.3.1

答案1

得分: 2

尝试使用.rlike函数。

示例:

  1. from pyspark.sql import functions, types
  2. from pyspark.sql.functions import *
  3. data2 = [("http://daasd.com/text.csv?c=uss1zhv1imikb4w", 2),
  4. ("http://oasnd.com/car.csv?c=913fh7n83n19ms98", 4),
  5. ("http://dunfdas.com/bread.csv?c=4968698835", 8),
  6. ("http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs", 7),
  7. ("http://daosj.com/text.csv?c=h7hgk1r3o3", 1),
  8. ("http://daosj.com/text.csv?c=h7hg", 1),
  9. ]
  10. schema = types.StructType([ \
  11. types.StructField("url", types.StringType(),True),\
  12. types.StructField("val", types.IntegerType(),True),
  13. ])
  14. df = spark.createDataFrame(data=data2, schema=schema)
  15. my_regex = r'text.csv\?[a-z]+=[a-zA-Z0-9]{10,25}$'
  16. df = df.withColumn('is_match', col("url").rlike(f"{my_regex}"))
  17. df.show(truncate=False)
  18. # +--------------------------------------------+---+--------+
  19. # |url |val|is_match|
  20. # +--------------------------------------------+---+--------+
  21. # |http://daasd.com/text.csv?c=uss1zhv1imikb4w |2 |true |
  22. # |http://oasnd.com/car.csv?c=913fh7n83n19ms98 |4 |false |
  23. # |http://dunfdas.com/bread.csv?c=4968698835 |8 |false |
  24. # |http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs|7 |false |
  25. # |http://daosj.com/text.csv?c=h7hgk1r3o3 |1 |true |
  26. # |http://daosj.com/text.csv?c=h7hg |1 |true |
  27. # +--------------------------------------------+---+--------+
英文:

Try with .rlike function.

Example:

  1. from pyspark.sql import functions, types
  2. from pyspark.sql.functions import *
  3. data2 = [("http://daasd.com/text.csv?c=uss1zhv1imikb4w", 2),
  4. ("http://oasnd.com/car.csv?c=913fh7n83n19ms98", 4),
  5. ("http://dunfdas.com/bread.csv?c=4968698835", 8),
  6. ("http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs", 7),
  7. ("http://daosj.com/text.csv?c=h7hgk1r3o3", 1),
  8. ("http://daosj.com/text.csv?c=h7hg", 1),
  9. ]
  10. schema = types.StructType([ \
  11. types.StructField("url", types.StringType(),True),\
  12. types.StructField("val", types.IntegerType(),True),
  13. ])
  14. df = spark.createDataFrame(data=data2, schema=schema)
  15. my_regex = r'text.csv\?[a-z]+=[a-zA-Z0-9]{10,25}$'
  16. df = df.withColumn('is_match', col("url").rlike(f"{my_regex}"))
  17. df.show(truncate=False)
  18. #+--------------------------------------------+---+--------+
  19. #|url |val|is_match|
  20. #+--------------------------------------------+---+--------+
  21. #|http://daasd.com/text.csv?c=uss1zhv1imikb4w |2 |true |
  22. #|http://oasnd.com/car.csv?c=913fh7n83n19ms98 |4 |false |
  23. #|http://dunfdas.com/bread.csv?c=4968698835 |8 |false |
  24. #|http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs|7 |false |
  25. #|http://daosj.com/text.csv?c=h7hgk1r3o3 |1 |true |
  26. #|http://daosj.com/text.csv?c=h7hg |1 |true |
  27. #+--------------------------------------------+---+--------+

huangapple
  • 本文由 发表于 2023年5月11日 02:20:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76221533.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定