应用 Java 函数 URLDecoder.decode 到 Spark 3 中的整个列。

huangapple go评论134阅读模式
英文:

Apply java funciton URLDecoder.decode to whole column in Spark 3

问题

我有一个包含URL编码字符串的数据帧列,例如:

我想要做这样的事情:

someDF.withColumn('newcol', URLDecoder.decode(col("mystring"), "utf-8"))
someDF.show()
|         mystring         |         newcol      |
--------------------------------------------------
| ThisIs%201rstString      | ThisIs 1rstString   |        
| This%20is%3Ethisone      | This is>thisone |
| and%20so%20one           | and so one          |

我应该如何做到这一点?我猜map函数可能在附近,但无法弄清如何使用它。

注意:这只是一个示例,不可能创建多个替换语句,因为还有许多其他编码字符,列表可能会有所变化,我想使用一个简单可靠的方法来做到这一点。

英文:

I have a dataframe column containing url encoded string such as:

I would like to do something like that:

someDF.withColumn('newcol', URLDecoder.decode( col("mystring"), "utf-8" ))
someDF.show()
|         mystring         |         newcol      |
--------------------------------------------------
| ThisIs%201rstString      | ThisIs 1rstString   |        
| This%20is%3Ethisone      | This is>thisone     |
| and%20so%20one           | and so one          |

How should I do such thing I guess map function is around the corner but can't firgure out how to us it.

Note: this is a sample and it is not an option to create multiple replace statement as there is many other encoded characters and list may vary, I'd like to use a simple reliable method to do so.

答案1

得分: 8

你可以尝试使用SparkSQL内置函数reflect

> reflect(class, method[, arg1[, arg2 ..]]) - 使用反射调用方法。

df = spark.createDataFrame([(e,) for e in ["ThisIs%201rstString", "This%20is%3Ethisone", "and%20so%20one"]], ["mystring"])

df.selectExpr("*", "reflect('java.net.URLDecoder','decode', mystring, 'utf-8') as newcol").show()

+-------------------+-----------------+
|           mystring|           newcol|
+-------------------+-----------------+
|ThisIs%201rstString|ThisIs 1rstString|
|This%20is%3Ethisone|  This is>thisone|
|     and%20so%20one|       and so one|
+-------------------+-----------------+

注意: 上述代码是Python代码,您也可以使用Scala实现相同功能。

英文:

You can try the SparkSQL builtin function reflect:

> reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.

df = spark.createDataFrame([(e,) for e in ["ThisIs%201rstString", "This%20is%3Ethisone", "and%20so%20one"]], ["mystring"])

df.selectExpr("*", "reflect('java.net.URLDecoder','decode', mystring, 'utf-8') as newcol").show()

+-------------------+-----------------+
|           mystring|           newcol|
+-------------------+-----------------+
|ThisIs%201rstString|ThisIs 1rstString|
|This%20is%3Ethisone|  This is>thisone|
|     and%20so%20one|       and so one|
+-------------------+-----------------+

Note: the above is Python code, you should be able to do the same with Scala.

答案2

得分: 1

创建一个UDF,执行以下操作

import java.net.URLDecoder
def decode(in: String) = URLDecoder.decode(in, "utf-8")
val decode_udf = udf(decode(_))
df.withColumn("newcol", decode_udf('mystring)).show()

打印预期结果。

英文:

Create a UDF that performs the work

import java.net.URLDecoder
def decode(in:String) =  URLDecoder.decode(in, "utf-8")
val decode_udf = udf(decode(_))
df.withColumn("newcol", decode_udf('mystring)).show()

prints the expected result.

huangapple
  • 本文由 发表于 2020年9月9日 23:26:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/63814833.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定