如何使用Spark Dataframe修剪列值

huangapple go评论57阅读模式
英文:

How to trim the column values using Spark Dataframe

问题

**我有一个类似下面的数据框,我需要使用Spark数据框修剪SCHDULE列中的值。**
*我尝试过使用UDF函数,但我没有得到预期的输出*

| SCHDULE   |  ID   | VALUE |
| --------  | ------|-------|
|100H/10AR1 | KL01  | 30    |
|100H/10TR2 | KL01  | 40    |
|100H/22TR1 | KL01  | 20    |
|100H/22TR2 | KL01  | 20    |
|105JK/12PK1| AA05  | 10    |
|105JK/12PK2| AA05  | 20    |
|105JH/33PK3| AA05  | 50    |
|105JH/33PK4| AA05  | 30    |
|110P/1     | BR03  | 20    |
|110P/2     | BR03  | 10    |


**我需要输出如下的数据框,请有人可以帮助我吗**

| SCHDULE   |  ID   | VALUE |
| ----------| ------|-------|
|100H/10AR1 | KL01  | 30    |
|100H/10TR2 | KL01  | 40    |
|100H/22TR1 | KL01  | 20    |
|100H/22TR2 | KL01  | 20    |
|105JK/12PK1| AA05  | 10    |
|105JK/12PK2| AA05  | 20    |
|105JH/33PK3| AA05  | 50    |
|105JH/33PK4| AA05  | 30    |
|110P/1     | BR03  | 20    |
|110P/2     | BR03  | 10    |
英文:

I have a dataframe like below, and I need to trim the values in SCHDULE column using spark dataframe.
I tried with UDF functions but I didn't get expected output

SCHDULE ID VALUE
100H/10AR1 KL01 30
100H/10TR2 KL01 40
100H/22TR1 KL01 20
100H/22TR2 KL01 20
105JK/12PK1 AA05 10
105JK/12PK2 AA05 20
105JH/33PK3 AA05 50
105JH/33PK4 AA05 30
110P/1 BR03 20
110P/2 BR03 10

I need output like the below dataframe, can anyone pls help me on this

SCHDULE ID VALUE
100H/10AR1 KL01 30
100H/10TR2 KL01 40
100H/22TR1 KL01 20
100H/22TR2 KL01 20
105JK/12PK1 AA05 10
105JK/12PK2 AA05 20
105JH/33PK3 AA05 50
105JH/33PK4 AA05 30
110P/1 BR03 20
110P/2 BR03 10

答案1

得分: 0

以下是翻译好的部分:

"Probably you dont need udf here, use function from Spark API, in this case regexp_extract may be usefull, below you can find sample code and regexp" 可能你这里不需要使用 UDF,可以使用 Spark API 中的函数,在这种情况下,regexp_extract 可能会很有用,下面你可以找到示例代码和正则表达式

"import org.apache.spark.sql.functions." 导入 org.apache.spark.sql.functions.

"val inputData = Seq(" 定义 inputData 序列:

""100H/10AR1"," ""105J/33PK4"," ""110P/1"" "100H/10AR1", "105J/33PK4", "110P/1")

"val inputDf = inputData.toDF("SCHDULE")" 定义 inputDf 数据框,将 inputData 转换为 DataFrame,并将列名设为 "SCHDULE"。

"inputDf.withColumn("Trimmed", regexp_extract($"SCHDULE","""^(\d+[A-Z]?/\d+).""",1)).show" 对 inputDf 执行 withColumn 操作,使用 regexp_extract 函数从 "SCHDULE" 列中提取匹配正则表达式 "^(\d+[A-Z]?/\d+)." 的第一个分组,并将结果存储在名为 "Trimmed" 的新列中,然后显示 DataFrame。

"Output:" 输出结果:

"+----------+-------+" 数据框的列名和分隔线

"| SCHDULE|Trimmed|" "SCHDULE" 和 "Trimmed" 列名

"+----------+-------+" 分隔线

"|100H/10AR1|100H/10|" 数据行

"|105J/33PK4|105J/33|"

"| 110P/1| 110P/1|"

"+----------+-------+" 分隔线

英文:

Probably you dont need udf here, use function from Spark API, in this case regexp_extract may be usefull, below you can find sample code and regexp

import org.apache.spark.sql.functions._

val inputData = Seq(
  "100H/10AR1",
  "105J/33PK4",
  "110P/1"
)

val inputDf = inputData.toDF("SCHDULE")
inputDf.withColumn("Trimmed", regexp_extract($"SCHDULE","""^(\d+[A-Z]?\/\d+).*""",1)).show

Output:

+----------+-------+
|   SCHDULE|Trimmed|
+----------+-------+
|100H/10AR1|100H/10|
|105J/33PK4|105J/33|
|    110P/1| 110P/1|
+----------+-------+

huangapple
  • 本文由 发表于 2023年7月10日 18:14:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76652758.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定