2023年5月11日 02:20:45go评论96阅读模式

英文:

Pyspark regexp_extract does not recognize '=' as a character?

问题

I see your code and the regex issue. To fix it for Pyspark, you can try using a backslash to escape the equals sign, like this:

my_regex = r'text.csv\?[a-z]+\=[a-zA-Z0-9]{10,25}$'

This should work with Pyspark to match URLs with the format you described.

英文:

I am trying to write a regular expression to match the urls that have text.csv, followed by a single letter parameter, followed by a set of characters that is between 10-25 characters.

I have this expression, which works with pandas, but does not work in Pyspark..

text.csv\?[a-z]+=[a-zA-Z0-9]{10,25}$

I have found that the problem is with the = sign, but I am not sure how to fix it. Here is a reproducible example.

from pyspark.sql import functions, types
data2 = [(&quot;http://daasd.com/text.csv?c=uss1zhv1imikb4w&quot;, 2),
         (&quot;http://oasnd.com/car.csv?c=913fh7n83n19ms98&quot;, 4),
         (&quot;http://dunfdas.com/bread.csv?c=4968698835&quot;, 8),
         (&quot;http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs&quot;, 7),
         (&quot;http://daosj.com/text.csv?c=h7hgk1r3o3&quot;, 1),
	 (&quot;http://daosj.com/text.csv?c=h7hg&quot;, 1),
  ]
schema = types.StructType([ \
    types.StructField(&quot;url&quot;, types.StringType(),True),\
    types.StructField(&quot;val&quot;, types.IntegerType(),True),
  ])
 
df = spark.createDataFrame(data=data2, schema=schema)
my_regex = r&#39;text.csv\?[a-z]+=[a-zA-Z0-9]{10,25}$&#39;
df = df.withColumn(&#39;is_match&#39;, functions.expr(f&quot;regexp_extract(url, &#39;{my_regex}&#39;, 0) != &#39;&#39;&quot;))
df.show(truncate=False)

# --- Result ---
+--------------------------------------------+---+--------+
|url                                         |val|is_match|
+--------------------------------------------+---+--------+
|http://daasd.com/text.csv?c=uss1zhv1imikb4w |2  |false   |
|http://oasnd.com/car.csv?c=913fh7n83n19ms98 |4  |false   |
|http://dunfdas.com/bread.csv?c=4968698835   |8  |false   |
|http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs|7  |false   |
|http://daosj.com/text.csv?c=h7hgk1r3o3      |1  |false   |
|http://daosj.com/text.csv?c=h7hg            |1  |false   |
+--------------------------------------------+---+--------+
# -- Desired Result --
+--------------------------------------------+---+--------+
|url                                         |val|is_match|
+--------------------------------------------+---+--------+
|http://daasd.com/text.csv?c=uss1zhv1imikb4w |2  |True    |
|http://oasnd.com/car.csv?c=913fh7n83n19ms98 |4  |false   |
|http://dunfdas.com/bread.csv?c=4968698835   |8  |false   |
|http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs|7  |false   |
|http://daosj.com/text.csv?c=h7hgk1r3o3      |1  |True    |
|http://daosj.com/text.csv?c=h7hg            |1  |false   |
+--------------------------------------------+---+--------+

Any ideas?

Spark version: 3.3.1

答案1

得分: 2

尝试使用.rlike函数。

示例：

from pyspark.sql import functions, types
from pyspark.sql.functions import *
data2 = [("http://daasd.com/text.csv?c=uss1zhv1imikb4w", 2),
         ("http://oasnd.com/car.csv?c=913fh7n83n19ms98", 4),
         ("http://dunfdas.com/bread.csv?c=4968698835", 8),
         ("http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs", 7),
         ("http://daosj.com/text.csv?c=h7hgk1r3o3", 1),
         ("http://daosj.com/text.csv?c=h7hg", 1),
      ]
schema = types.StructType([ \
    types.StructField("url", types.StringType(),True),\
    types.StructField("val", types.IntegerType(),True),
  ])
 
df = spark.createDataFrame(data=data2, schema=schema)
my_regex = r'text.csv\?[a-z]+=[a-zA-Z0-9]{10,25}$'
df = df.withColumn('is_match', col("url").rlike(f"{my_regex}"))
df.show(truncate=False)
# +--------------------------------------------+---+--------+
# |url                                         |val|is_match|
# +--------------------------------------------+---+--------+
# |http://daasd.com/text.csv?c=uss1zhv1imikb4w |2  |true    |
# |http://oasnd.com/car.csv?c=913fh7n83n19ms98 |4  |false   |
# |http://dunfdas.com/bread.csv?c=4968698835   |8  |false   |
# |http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs|7  |false   |
# |http://daosj.com/text.csv?c=h7hgk1r3o3      |1  |true    |
# |http://daosj.com/text.csv?c=h7hg            |1  |true    |
# +--------------------------------------------+---+--------+

英文:

Try with .rlike function.

Example:

from pyspark.sql import functions, types
from pyspark.sql.functions import *
data2 = [(&quot;http://daasd.com/text.csv?c=uss1zhv1imikb4w&quot;, 2),
         (&quot;http://oasnd.com/car.csv?c=913fh7n83n19ms98&quot;, 4),
         (&quot;http://dunfdas.com/bread.csv?c=4968698835&quot;, 8),
         (&quot;http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs&quot;, 7),
         (&quot;http://daosj.com/text.csv?c=h7hgk1r3o3&quot;, 1),
     (&quot;http://daosj.com/text.csv?c=h7hg&quot;, 1),
  ]
schema = types.StructType([ \
    types.StructField(&quot;url&quot;, types.StringType(),True),\
    types.StructField(&quot;val&quot;, types.IntegerType(),True),
  ])
 
df = spark.createDataFrame(data=data2, schema=schema)
my_regex = r&#39;text.csv\?[a-z]+=[a-zA-Z0-9]{10,25}$&#39;
df = df.withColumn(&#39;is_match&#39;, col(&quot;url&quot;).rlike(f&quot;{my_regex}&quot;))
df.show(truncate=False)
#+--------------------------------------------+---+--------+
#|url                                         |val|is_match|
#+--------------------------------------------+---+--------+
#|http://daasd.com/text.csv?c=uss1zhv1imikb4w |2  |true    |
#|http://oasnd.com/car.csv?c=913fh7n83n19ms98 |4  |false   |
#|http://dunfdas.com/bread.csv?c=4968698835   |8  |false   |
#|http://dasuugfb.com/meat.csv?c=0uhkmr9dvs3hs|7  |false   |
#|http://daosj.com/text.csv?c=h7hgk1r3o3      |1  |true    |
#|http://daosj.com/text.csv?c=h7hg            |1  |true    |
#+--------------------------------------------+---+--------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pyspark的regexp_extract无法识别’=’作为一个字符？

问题

答案1

PostgreSQL：将记录插入到主表引用的其他表中。

在SQL中创建每小时的计数。

the best way to censor words in golang

SQL基于条件的行选择

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。