英文:
Pyspark regexp_extract does not recognize '=' as a character?
问题
I see your code and the regex issue. To fix it for Pyspark, you can try using a backslash to escape the equals sign, like this:
my_regex = r'text.csv\?[a-z]+\=[a-zA-Z0-9]{10,25}$'
This should work with Pyspark to match URLs with the format you described.
英文:
I am trying to write a regular expression to match the urls that have text.csv, followed by a single letter parameter, followed by a set of characters that is between 10-25 characters.
I have this expression, which works with pandas, but does not work in Pyspark..
text.csv\?[a-z]+=[a-zA-Z0-9]{10,25}$
I have found that the problem is with the =
sign, but I am not sure how to fix it. Here is a reproducible example.
from pyspark.sql import functions, types
data2 = [("http://daasd.com/text.csv?c=uss1zhv1imikb4w", 2),
("http://oasnd.com/car.csv?c=913fh7n83n19ms98", 4),
("http://dunfdas.com/bread.csv?c=4968698835", 8),
("http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs", 7),
("http://daosj.com/text.csv?c=h7hgk1r3o3", 1),
("http://daosj.com/text.csv?c=h7hg", 1),
]
schema = types.StructType([ \
types.StructField("url", types.StringType(),True),\
types.StructField("val", types.IntegerType(),True),
])
df = spark.createDataFrame(data=data2, schema=schema)
my_regex = r'text.csv\?[a-z]+=[a-zA-Z0-9]{10,25}$'
df = df.withColumn('is_match', functions.expr(f"regexp_extract(url, '{my_regex}', 0) != ''"))
df.show(truncate=False)
# --- Result ---
+--------------------------------------------+---+--------+
|url |val|is_match|
+--------------------------------------------+---+--------+
|http://daasd.com/text.csv?c=uss1zhv1imikb4w |2 |false |
|http://oasnd.com/car.csv?c=913fh7n83n19ms98 |4 |false |
|http://dunfdas.com/bread.csv?c=4968698835 |8 |false |
|http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs|7 |false |
|http://daosj.com/text.csv?c=h7hgk1r3o3 |1 |false |
|http://daosj.com/text.csv?c=h7hg |1 |false |
+--------------------------------------------+---+--------+
# -- Desired Result --
+--------------------------------------------+---+--------+
|url |val|is_match|
+--------------------------------------------+---+--------+
|http://daasd.com/text.csv?c=uss1zhv1imikb4w |2 |True |
|http://oasnd.com/car.csv?c=913fh7n83n19ms98 |4 |false |
|http://dunfdas.com/bread.csv?c=4968698835 |8 |false |
|http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs|7 |false |
|http://daosj.com/text.csv?c=h7hgk1r3o3 |1 |True |
|http://daosj.com/text.csv?c=h7hg |1 |false |
+--------------------------------------------+---+--------+
Any ideas?
Spark version: 3.3.1
答案1
得分: 2
尝试使用.rlike
函数。
示例:
from pyspark.sql import functions, types
from pyspark.sql.functions import *
data2 = [("http://daasd.com/text.csv?c=uss1zhv1imikb4w", 2),
("http://oasnd.com/car.csv?c=913fh7n83n19ms98", 4),
("http://dunfdas.com/bread.csv?c=4968698835", 8),
("http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs", 7),
("http://daosj.com/text.csv?c=h7hgk1r3o3", 1),
("http://daosj.com/text.csv?c=h7hg", 1),
]
schema = types.StructType([ \
types.StructField("url", types.StringType(),True),\
types.StructField("val", types.IntegerType(),True),
])
df = spark.createDataFrame(data=data2, schema=schema)
my_regex = r'text.csv\?[a-z]+=[a-zA-Z0-9]{10,25}$'
df = df.withColumn('is_match', col("url").rlike(f"{my_regex}"))
df.show(truncate=False)
# +--------------------------------------------+---+--------+
# |url |val|is_match|
# +--------------------------------------------+---+--------+
# |http://daasd.com/text.csv?c=uss1zhv1imikb4w |2 |true |
# |http://oasnd.com/car.csv?c=913fh7n83n19ms98 |4 |false |
# |http://dunfdas.com/bread.csv?c=4968698835 |8 |false |
# |http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs|7 |false |
# |http://daosj.com/text.csv?c=h7hgk1r3o3 |1 |true |
# |http://daosj.com/text.csv?c=h7hg |1 |true |
# +--------------------------------------------+---+--------+
英文:
Try with .rlike
function.
Example:
from pyspark.sql import functions, types
from pyspark.sql.functions import *
data2 = [("http://daasd.com/text.csv?c=uss1zhv1imikb4w", 2),
("http://oasnd.com/car.csv?c=913fh7n83n19ms98", 4),
("http://dunfdas.com/bread.csv?c=4968698835", 8),
("http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs", 7),
("http://daosj.com/text.csv?c=h7hgk1r3o3", 1),
("http://daosj.com/text.csv?c=h7hg", 1),
]
schema = types.StructType([ \
types.StructField("url", types.StringType(),True),\
types.StructField("val", types.IntegerType(),True),
])
df = spark.createDataFrame(data=data2, schema=schema)
my_regex = r'text.csv\?[a-z]+=[a-zA-Z0-9]{10,25}$'
df = df.withColumn('is_match', col("url").rlike(f"{my_regex}"))
df.show(truncate=False)
#+--------------------------------------------+---+--------+
#|url |val|is_match|
#+--------------------------------------------+---+--------+
#|http://daasd.com/text.csv?c=uss1zhv1imikb4w |2 |true |
#|http://oasnd.com/car.csv?c=913fh7n83n19ms98 |4 |false |
#|http://dunfdas.com/bread.csv?c=4968698835 |8 |false |
#|http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs|7 |false |
#|http://daosj.com/text.csv?c=h7hgk1r3o3 |1 |true |
#|http://daosj.com/text.csv?c=h7hg |1 |true |
#+--------------------------------------------+---+--------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论