使用PySpark将列值子字符串替换为子字符串的哈希值

huangapple go评论98阅读模式
英文:

Replace column value substring with hash of substring in PySpark

问题

I have a dataframe with a column containing a description including customer ids which I need to replace with their sha2 hashed version.

Example: the column value "X customer 0013120109 in country AU." should be turned into "X customer d8e824e6a2d5b32830c93ee0ca690ac6cb976cc51706b1a856cd1a95826bebd in country AU."

MRE:

  1. from pyspark.sql.dataframe import DataFrame
  2. from pyspark.sql.functions import col, sha2, regexp_replace, lit, concat
  3. from pyspark.sql.types import LongType, StringType, StructField, StructType
  4. data = [
  5. [1, "Sold device 11312."],
  6. [2, "X customer 0013120109 in country AU."],
  7. [3, "Y customer 0013140033 in country BR."],
  8. ]
  9. schema = StructType(
  10. [
  11. StructField(name="Id", dataType=LongType()),
  12. StructField(name="Description", dataType=StringType())
  13. ]
  14. )
  15. df = spark.createDataFrame(data=data, schema=schema)

My attempted solution was to use regexp_replace in combination with regexp_extract, but it expects a concrete string as "replacement" value - while my replacement value would be dynamic.

  1. df = (
  2. df
  3. .withColumn("Description", regexp_replace("Description",
  4. r"customer \d+",
  5. concat(lit("customer "),
  6. sha2(regexp_extract(
  7. "Description",
  8. r".* customer (\d+) .*",
  9. 1),
  10. 256
  11. )
  12. )
  13. )
  14. )
  15. )

PS: I really want to avoid UDFs since the transformation from JVM to Python and back is a huge performance degradation.

英文:

I have a dataframe with a column containing a description including customer ids which I need to replace with their sha2 hashed version.

Example: the column value "X customer 0013120109 in country AU. should be turned into "X customer d8e824e6a2d5b32830c93ee0ca690ac6cb976cc51706b1a856cd1a95826bebd in country AU.

MRE:

  1. from pyspark.sql.dataframe import DataFrame
  2. from pyspark.sql.functions import col, sha2, regexp_replace, lit, concat
  3. from pyspark.sql.types import LongType, StringType, StructField, StructType
  4. data = [
  5. [1, "Sold device 11312."],
  6. [2, "X customer 0013120109 in country AU."],
  7. [3, "Y customer 0013140033 in country BR."],
  8. ]
  9. schema = StructType(
  10. [
  11. StructField(name="Id", dataType=LongType()),
  12. StructField(name="Description", dataType=StringType())
  13. ]
  14. )
  15. df = spark.createDataFrame(data=data, schema=schema)

My attempted solution was to use regexp_replace in combination with regexp_extract, but it expects a concrete string as "replacement" value - while my replacement value would be dynamic.

  1. df = (
  2. df
  3. .withColumn("Description", regexp_replace("Description",
  4. r"customer \d+",
  5. concat(lit("customer "),
  6. sha2(regexp_extract(
  7. "Description",
  8. r".* customer (\d+) .*",
  9. 1),
  10. 256
  11. )
  12. )
  13. )
  14. )
  15. )

PS: I really want to avoid UDFs since the transformation from JVM to Python and back is a huge performance degradation...

答案1

得分: 0

成功找到一个解决方案,使用concatsubstrexpr的组合:

  1. df = (
  2. df
  3. .withColumn("regexp_extract",
  4. regexp_extract('Description', '.* customer (\d+) .*', 1)
  5. )
  6. .withColumn("NewColumn",
  7. when(expr("length(regexp_extract)") > lit(1),
  8. # Concatenate the substrings: [before_id, hashed_id, after_id]
  9. concat(
  10. # From the start of the column until the index of the extract.
  11. col("Description").substr(lit(0), expr("instr(Description, regexp_extract)") - lit(1)),
  12. # Hashed extract
  13. sha2(regexp_extract('Description', '.* customer (\d+) .*', 1), 256),
  14. # Subtr from (Index extract + length of the extract) until the end of the Column.
  15. col("Description").substr(expr("instr(Description, regexp_extract)") +
  16. expr("length(regexp_extract)"), expr("length(Description)"))
  17. )
  18. ).otherwise(col("Description"))
  19. )
  20. .drop("regexp_extract")
  21. )
英文:

Managed to find a solution using a combination of concat, substr and expr:

  1. df = (
  2. df
  3. .withColumn("regexp_extract",
  4. regexp_extract('Description', '.* customer (\\d+) .*', 1)
  5. )
  6. .withColumn("NewColumn",
  7. when(expr("length(regexp_extract)") > lit(1),
  8. # Concatenate the substrings: [before_id, hashed_id, after_id]
  9. concat(
  10. # From the start of the column until the index of the extract.
  11. col("Description").substr(lit(0), expr("instr(Description, regexp_extract)") - lit(1)),
  12. # Hashed extract
  13. sha2(regexp_extract('Description', '.* customer (\\d+) .*', 1), 256),
  14. # Subtr from (Index extract + length of the extract) until the end of the Column.
  15. col("Description").substr(expr("instr(Description, regexp_extract)") +
  16. expr("length(regexp_extract)"), expr("length(Description)"))
  17. )
  18. ).otherwise(col("Description"))
  19. )
  20. .drop("regexp_extract")
  21. )

答案2

得分: 0

这是我的两分建议:

  • 方法非常简单,将字符串分为三个部分:

    1. 在客户 ID 之前的部分
    2. 客户 ID
    3. 客户 ID 之后的部分
  • 然后对客户 ID 进行遮蔽,并连接这三个部分。

代码:

  1. df_final = df.withColumn("regexp_extract1",
  2. when(regexp_extract('Description', '^(.* customer) (\d+)(.*)$', 1) != '',
  3. regexp_extract('Description', '^(.* customer) (\d+)(.*)$', 1))
  4. .otherwise(col('Description')))
  5. .withColumn("regexp_extract2", regexp_extract('Description', '^(.* customer) (\d+)(.*)$', 2))
  6. .withColumn("regexp_extract3", regexp_extract('Description', '^(.* customer) (\d+)(.*)$', 3))
  7. .withColumn("extract2_sha2", when(col('regexp_extract2') != '', sha2("regexp_extract2", 256)).otherwise(''))
  8. .withColumn('Masked Description', F.concat(col('regexp_extract1'), lit(' '), col('extract2_sha2'), col('regexp_extract3')))
  9. .drop(*['regexp_extract1', 'regexp_extract2', 'regexp_extract3', 'extract2_sha2'])
  10. df_final.show(truncate=False)

输出:

  1. +---+------------------------------------+------------------------------------------------------------------------------------------+
  2. |Id |Description |Masked Description |
  3. +---+------------------------------------+------------------------------------------------------------------------------------------+
  4. |1 |Sold device 11312. |Sold device 11312. |
  5. |2 |X customer 0013120109 in country AU.|X customer d8e824e6a2d5b32830c93ee0ca690ac6cb976cc51706b1a856cd1a95826bebdb in country AU.|
  6. |3 |Y customer 0013140033 in country BR.|Y customer 2f4ab0aeb1f3332b8b9ccdd3a9fca759f267074e6621e5362acdd6f22211f167 in country BR.|
  7. +---+------------------------------------+------------------------------------------------------------------------------------------+

使用PySpark将列值子字符串替换为子字符串的哈希值

英文:

Here are my 2 cents:

  • Approach is quite simple, split the string into 3 parts:
  1. One with anything before the customer id
  2. customer id
  3. Anything after customer id.
  • Then mask the customer id, and concat all 3 of them.

Code:

  1. df_final = df .withColumn("regexp_extract1",
  2. when(regexp_extract('Description', '^(.* customer) (\\d+)(.*)$', 1) != '',
  3. regexp_extract('Description', '^(.* customer) (\\d+)(.*)$', 1))\
  4. .otherwise(col('Description')))\
  5. .withColumn("regexp_extract2", regexp_extract('Description', '^(.* customer) (\\d+)(.*)$', 2))\
  6. .withColumn("regexp_extract3", regexp_extract('Description', '^(.* customer) (\\d+)(.*)$', 3))\
  7. .withColumn("extract2_sha2",when(col('regexp_extract2')!='',sha2("regexp_extract2",256)).otherwise(''))\
  8. .withColumn('Masked Description',F.concat(col('regexp_extract1'),lit(' '),col('extract2_sha2'),col('regexp_extract3')))\
  9. .drop(*['regexp_extract1','regexp_extract2','regexp_extract3','extract2_sha2'])
  10. df_final.show(truncate=False)

Output:

  1. +---+------------------------------------+------------------------------------------------------------------------------------------+
  2. |Id |Description |Masked Description |
  3. +---+------------------------------------+------------------------------------------------------------------------------------------+
  4. |1 |Sold device 11312. |Sold device 11312. |
  5. |2 |X customer 0013120109 in country AU.|X customer d8e824e6a2d5b32830c93ee0ca690ac6cb976cc51706b1a856cd1a95826bebdb in country AU.|
  6. |3 |Y customer 0013140033 in country BR.|Y customer 2f4ab0aeb1f3332b8b9ccdd3a9fca759f267074e6621e5362acdd6f22211f167 in country BR.|
  7. +---+------------------------------------+------------------------------------------------------------------------------------------+

使用PySpark将列值子字符串替换为子字符串的哈希值

huangapple
  • 本文由 发表于 2023年3月7日 18:23:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/75660706.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定