Pyspark的regexp_extract无法识别’=’作为一个字符?

huangapple go评论69阅读模式
英文:

Pyspark regexp_extract does not recognize '=' as a character?

问题

I see your code and the regex issue. To fix it for Pyspark, you can try using a backslash to escape the equals sign, like this:

my_regex = r'text.csv\?[a-z]+\=[a-zA-Z0-9]{10,25}$'

This should work with Pyspark to match URLs with the format you described.

英文:

I am trying to write a regular expression to match the urls that have text.csv, followed by a single letter parameter, followed by a set of characters that is between 10-25 characters.

I have this expression, which works with pandas, but does not work in Pyspark..

text.csv\?[a-z]+=[a-zA-Z0-9]{10,25}$

I have found that the problem is with the = sign, but I am not sure how to fix it. Here is a reproducible example.

from pyspark.sql import functions, types

data2 = [("http://daasd.com/text.csv?c=uss1zhv1imikb4w", 2),
         ("http://oasnd.com/car.csv?c=913fh7n83n19ms98", 4),
         ("http://dunfdas.com/bread.csv?c=4968698835", 8),
         ("http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs", 7),
         ("http://daosj.com/text.csv?c=h7hgk1r3o3", 1),
	 ("http://daosj.com/text.csv?c=h7hg", 1),
  ]

schema = types.StructType([ \
    types.StructField("url", types.StringType(),True),\
    types.StructField("val", types.IntegerType(),True),
  ])
 
df = spark.createDataFrame(data=data2, schema=schema)

my_regex = r'text.csv\?[a-z]+=[a-zA-Z0-9]{10,25}$'
df = df.withColumn('is_match', functions.expr(f"regexp_extract(url, '{my_regex}', 0) != ''"))

df.show(truncate=False)
# --- Result ---
+--------------------------------------------+---+--------+
|url                                         |val|is_match|
+--------------------------------------------+---+--------+
|http://daasd.com/text.csv?c=uss1zhv1imikb4w |2  |false   |
|http://oasnd.com/car.csv?c=913fh7n83n19ms98 |4  |false   |
|http://dunfdas.com/bread.csv?c=4968698835   |8  |false   |
|http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs|7  |false   |
|http://daosj.com/text.csv?c=h7hgk1r3o3      |1  |false   |
|http://daosj.com/text.csv?c=h7hg            |1  |false   |
+--------------------------------------------+---+--------+

# -- Desired Result --
+--------------------------------------------+---+--------+
|url                                         |val|is_match|
+--------------------------------------------+---+--------+
|http://daasd.com/text.csv?c=uss1zhv1imikb4w |2  |True    |
|http://oasnd.com/car.csv?c=913fh7n83n19ms98 |4  |false   |
|http://dunfdas.com/bread.csv?c=4968698835   |8  |false   |
|http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs|7  |false   |
|http://daosj.com/text.csv?c=h7hgk1r3o3      |1  |True    |
|http://daosj.com/text.csv?c=h7hg            |1  |false   |
+--------------------------------------------+---+--------+

Any ideas?

Spark version: 3.3.1

答案1

得分: 2

尝试使用.rlike函数。

示例:

from pyspark.sql import functions, types
from pyspark.sql.functions import *

data2 = [("http://daasd.com/text.csv?c=uss1zhv1imikb4w", 2),
         ("http://oasnd.com/car.csv?c=913fh7n83n19ms98", 4),
         ("http://dunfdas.com/bread.csv?c=4968698835", 8),
         ("http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs", 7),
         ("http://daosj.com/text.csv?c=h7hgk1r3o3", 1),
         ("http://daosj.com/text.csv?c=h7hg", 1),
      ]

schema = types.StructType([ \
    types.StructField("url", types.StringType(),True),\
    types.StructField("val", types.IntegerType(),True),
  ])
 
df = spark.createDataFrame(data=data2, schema=schema)

my_regex = r'text.csv\?[a-z]+=[a-zA-Z0-9]{10,25}$'
df = df.withColumn('is_match', col("url").rlike(f"{my_regex}"))

df.show(truncate=False)
# +--------------------------------------------+---+--------+
# |url                                         |val|is_match|
# +--------------------------------------------+---+--------+
# |http://daasd.com/text.csv?c=uss1zhv1imikb4w |2  |true    |
# |http://oasnd.com/car.csv?c=913fh7n83n19ms98 |4  |false   |
# |http://dunfdas.com/bread.csv?c=4968698835   |8  |false   |
# |http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs|7  |false   |
# |http://daosj.com/text.csv?c=h7hgk1r3o3      |1  |true    |
# |http://daosj.com/text.csv?c=h7hg            |1  |true    |
# +--------------------------------------------+---+--------+
英文:

Try with .rlike function.

Example:

from pyspark.sql import functions, types
from pyspark.sql.functions import *

data2 = [("http://daasd.com/text.csv?c=uss1zhv1imikb4w", 2),
         ("http://oasnd.com/car.csv?c=913fh7n83n19ms98", 4),
         ("http://dunfdas.com/bread.csv?c=4968698835", 8),
         ("http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs", 7),
         ("http://daosj.com/text.csv?c=h7hgk1r3o3", 1),
     ("http://daosj.com/text.csv?c=h7hg", 1),
  ]

schema = types.StructType([ \
    types.StructField("url", types.StringType(),True),\
    types.StructField("val", types.IntegerType(),True),
  ])
 
df = spark.createDataFrame(data=data2, schema=schema)

my_regex = r'text.csv\?[a-z]+=[a-zA-Z0-9]{10,25}$'
df = df.withColumn('is_match', col("url").rlike(f"{my_regex}"))

df.show(truncate=False)
#+--------------------------------------------+---+--------+
#|url                                         |val|is_match|
#+--------------------------------------------+---+--------+
#|http://daasd.com/text.csv?c=uss1zhv1imikb4w |2  |true    |
#|http://oasnd.com/car.csv?c=913fh7n83n19ms98 |4  |false   |
#|http://dunfdas.com/bread.csv?c=4968698835   |8  |false   |
#|http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs|7  |false   |
#|http://daosj.com/text.csv?c=h7hgk1r3o3      |1  |true    |
#|http://daosj.com/text.csv?c=h7hg            |1  |true    |
#+--------------------------------------------+---+--------+

huangapple
  • 本文由 发表于 2023年5月11日 02:20:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76221533.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定