2023年3月7日 18:23:00go评论98阅读模式

英文:

Replace column value substring with hash of substring in PySpark

问题

I have a dataframe with a column containing a description including customer ids which I need to replace with their sha2 hashed version.

Example: the column value "X customer 0013120109 in country AU." should be turned into "X customer d8e824e6a2d5b32830c93ee0ca690ac6cb976cc51706b1a856cd1a95826bebd in country AU."

MRE:

from pyspark.sql.dataframe import DataFrame
from pyspark.sql.functions import col, sha2, regexp_replace, lit, concat
from pyspark.sql.types import LongType, StringType, StructField, StructType
data = [
    [1, "Sold device 11312."],
    [2, "X customer 0013120109 in country AU."],
    [3, "Y customer 0013140033 in country BR."],
]
schema = StructType(
    [
        StructField(name="Id", dataType=LongType()),
        StructField(name="Description", dataType=StringType())
    ]
)
df = spark.createDataFrame(data=data, schema=schema)

My attempted solution was to use regexp_replace in combination with regexp_extract, but it expects a concrete string as "replacement" value - while my replacement value would be dynamic.

df = (
     df
     .withColumn("Description", regexp_replace("Description",
                                           r"customer \d+",
                                           concat(lit("customer "), 
                                                  sha2(regexp_extract(
                                                          "Description",
                                                          r".* customer (\d+) .*",
                                                          1),
                                                       256
                                                       )
                                                  )
                                           )
            )
    )

PS: I really want to avoid UDFs since the transformation from JVM to Python and back is a huge performance degradation.

英文:

I have a dataframe with a column containing a description including customer ids which I need to replace with their sha2 hashed version.

Example: the column value "X customer 0013120109 in country AU. should be turned into "X customer d8e824e6a2d5b32830c93ee0ca690ac6cb976cc51706b1a856cd1a95826bebd in country AU.

MRE:

from pyspark.sql.dataframe import DataFrame
from pyspark.sql.functions import col, sha2, regexp_replace, lit, concat
from pyspark.sql.types import LongType, StringType, StructField, StructType
data = [
    [1, &quot;Sold device 11312.&quot;],
    [2, &quot;X customer 0013120109 in country AU.&quot;],
    [3, &quot;Y customer 0013140033 in country BR.&quot;],
]
schema = StructType(
    [
        StructField(name=&quot;Id&quot;, dataType=LongType()),
        StructField(name=&quot;Description&quot;, dataType=StringType())
    ]
)
df = spark.createDataFrame(data=data, schema=schema)

My attempted solution was to use regexp_replace in combination with regexp_extract, but it expects a concrete string as "replacement" value - while my replacement value would be dynamic.

df = (
     df
     .withColumn(&quot;Description&quot;, regexp_replace(&quot;Description&quot;,
                                               r&quot;customer \d+&quot;,
                                               concat(lit(&quot;customer &quot;), 
                                                      sha2(regexp_extract(
                                                              &quot;Description&quot;,
                                                              r&quot;.* customer (\d+) .*&quot;,
                                                              1),
                                                           256
                                                           )
                                                      )
                                               )
                )
        )

PS: I really want to avoid UDFs since the transformation from JVM to Python and back is a huge performance degradation...

答案1

得分: 0

成功找到一个解决方案，使用concat、substr和expr的组合：

df = (
    df
    .withColumn("regexp_extract",
                regexp_extract('Description', '.* customer (\d+) .*', 1)
                )
    .withColumn("NewColumn",
                when(expr("length(regexp_extract)") > lit(1),
                     # Concatenate the substrings: [before_id, hashed_id, after_id]
                     concat(
                         # From the start of the column until the index of the extract.
                         col("Description").substr(lit(0), expr("instr(Description, regexp_extract)") - lit(1)),
                         # Hashed extract
                         sha2(regexp_extract('Description', '.* customer (\d+) .*', 1), 256),
                         # Subtr from (Index extract + length of the extract) until the end of the Column.
                         col("Description").substr(expr("instr(Description, regexp_extract)") +
                                                   expr("length(regexp_extract)"), expr("length(Description)"))
                     )
                     ).otherwise(col("Description"))
                )
    .drop("regexp_extract")
)

英文:

Managed to find a solution using a combination of concat, substr and expr:

df = (
df
.withColumn(&quot;regexp_extract&quot;,
            regexp_extract(&#39;Description&#39;, &#39;.* customer (\\d+) .*&#39;, 1)
            )
.withColumn(&quot;NewColumn&quot;,
            when(expr(&quot;length(regexp_extract)&quot;) &gt; lit(1),
                 # Concatenate the substrings: [before_id, hashed_id, after_id]
                 concat(
                     # From the start of the column until the index of the extract.
                     col(&quot;Description&quot;).substr(lit(0), expr(&quot;instr(Description, regexp_extract)&quot;) - lit(1)),
                     # Hashed extract
                     sha2(regexp_extract(&#39;Description&#39;, &#39;.* customer (\\d+) .*&#39;, 1), 256),
                     # Subtr from (Index extract + length of the extract) until the end of the Column.
                     col(&quot;Description&quot;).substr(expr(&quot;instr(Description, regexp_extract)&quot;) +
                                               expr(&quot;length(regexp_extract)&quot;), expr(&quot;length(Description)&quot;))
                 )
                 ).otherwise(col(&quot;Description&quot;))
            )
.drop(&quot;regexp_extract&quot;)
)

答案2

得分: 0

这是我的两分建议：

方法非常简单，将字符串分为三个部分：
1. 在客户 ID 之前的部分
2. 客户 ID
3. 客户 ID 之后的部分
然后对客户 ID 进行遮蔽，并连接这三个部分。

代码：

df_final = df.withColumn("regexp_extract1",
    when(regexp_extract('Description', '^(.* customer) (\d+)(.*)$', 1) != '',
        regexp_extract('Description', '^(.* customer) (\d+)(.*)$', 1))
        .otherwise(col('Description')))
    .withColumn("regexp_extract2", regexp_extract('Description', '^(.* customer) (\d+)(.*)$', 2))
    .withColumn("regexp_extract3", regexp_extract('Description', '^(.* customer) (\d+)(.*)$', 3))
    .withColumn("extract2_sha2", when(col('regexp_extract2') != '', sha2("regexp_extract2", 256)).otherwise(''))
    .withColumn('Masked Description', F.concat(col('regexp_extract1'), lit(' '), col('extract2_sha2'), col('regexp_extract3')))
    .drop(*['regexp_extract1', 'regexp_extract2', 'regexp_extract3', 'extract2_sha2'])
df_final.show(truncate=False)

输出：

+---+------------------------------------+------------------------------------------------------------------------------------------+
|Id |Description                         |Masked Description                                                                        |
+---+------------------------------------+------------------------------------------------------------------------------------------+
|1  |Sold device 11312.                  |Sold device 11312.                                                                        |
|2  |X customer 0013120109 in country AU.|X customer d8e824e6a2d5b32830c93ee0ca690ac6cb976cc51706b1a856cd1a95826bebdb in country AU.|
|3  |Y customer 0013140033 in country BR.|Y customer 2f4ab0aeb1f3332b8b9ccdd3a9fca759f267074e6621e5362acdd6f22211f167 in country BR.|
+---+------------------------------------+------------------------------------------------------------------------------------------+

使用PySpark将列值子字符串替换为子字符串的哈希值

英文:

Here are my 2 cents:

Approach is quite simple, split the string into 3 parts:

One with anything before the customer id
customer id
Anything after customer id.

Then mask the customer id, and concat all 3 of them.

Code:

 df_final = df .withColumn(&quot;regexp_extract1&quot;,
                when(regexp_extract(&#39;Description&#39;, &#39;^(.* customer) (\\d+)(.*)$&#39;, 1) != &#39;&#39;,
                     regexp_extract(&#39;Description&#39;, &#39;^(.* customer) (\\d+)(.*)$&#39;, 1))\
                     .otherwise(col(&#39;Description&#39;)))\
 .withColumn(&quot;regexp_extract2&quot;, regexp_extract(&#39;Description&#39;, &#39;^(.* customer) (\\d+)(.*)$&#39;, 2))\
 .withColumn(&quot;regexp_extract3&quot;, regexp_extract(&#39;Description&#39;, &#39;^(.* customer) (\\d+)(.*)$&#39;, 3))\
 .withColumn(&quot;extract2_sha2&quot;,when(col(&#39;regexp_extract2&#39;)!=&#39;&#39;,sha2(&quot;regexp_extract2&quot;,256)).otherwise(&#39;&#39;))\
 .withColumn(&#39;Masked Description&#39;,F.concat(col(&#39;regexp_extract1&#39;),lit(&#39; &#39;),col(&#39;extract2_sha2&#39;),col(&#39;regexp_extract3&#39;)))\
 .drop(*[&#39;regexp_extract1&#39;,&#39;regexp_extract2&#39;,&#39;regexp_extract3&#39;,&#39;extract2_sha2&#39;])
df_final.show(truncate=False)

Output:

+---+------------------------------------+------------------------------------------------------------------------------------------+
|Id |Description                         |Masked Description                                                                        |
+---+------------------------------------+------------------------------------------------------------------------------------------+
|1  |Sold device 11312.                  |Sold device 11312.                                                                        |
|2  |X customer 0013120109 in country AU.|X customer d8e824e6a2d5b32830c93ee0ca690ac6cb976cc51706b1a856cd1a95826bebdb in country AU.|
|3  |Y customer 0013140033 in country BR.|Y customer 2f4ab0aeb1f3332b8b9ccdd3a9fca759f267074e6621e5362acdd6f22211f167 in country BR.|
+---+------------------------------------+------------------------------------------------------------------------------------------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用PySpark将列值子字符串替换为子字符串的哈希值

问题

答案1

答案2

PyBluez: error in PyBluez setup command: use_2to3 is invalid

‘tuple’ object does not support item assignment in torch.cat()

如何将数据插入到CSV文件的所需列中？

如何使用字典来存储具有不同参数要求的多个函数

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。