英文:
How to trim the column values using Spark Dataframe
问题
**我有一个类似下面的数据框,我需要使用Spark数据框修剪SCHDULE列中的值。**
*我尝试过使用UDF函数,但我没有得到预期的输出*
| SCHDULE | ID | VALUE |
| -------- | ------|-------|
|100H/10AR1 | KL01 | 30 |
|100H/10TR2 | KL01 | 40 |
|100H/22TR1 | KL01 | 20 |
|100H/22TR2 | KL01 | 20 |
|105JK/12PK1| AA05 | 10 |
|105JK/12PK2| AA05 | 20 |
|105JH/33PK3| AA05 | 50 |
|105JH/33PK4| AA05 | 30 |
|110P/1 | BR03 | 20 |
|110P/2 | BR03 | 10 |
**我需要输出如下的数据框,请有人可以帮助我吗**
| SCHDULE | ID | VALUE |
| ----------| ------|-------|
|100H/10AR1 | KL01 | 30 |
|100H/10TR2 | KL01 | 40 |
|100H/22TR1 | KL01 | 20 |
|100H/22TR2 | KL01 | 20 |
|105JK/12PK1| AA05 | 10 |
|105JK/12PK2| AA05 | 20 |
|105JH/33PK3| AA05 | 50 |
|105JH/33PK4| AA05 | 30 |
|110P/1 | BR03 | 20 |
|110P/2 | BR03 | 10 |
英文:
I have a dataframe like below, and I need to trim the values in SCHDULE column using spark dataframe.
I tried with UDF functions but I didn't get expected output
SCHDULE | ID | VALUE |
---|---|---|
100H/10AR1 | KL01 | 30 |
100H/10TR2 | KL01 | 40 |
100H/22TR1 | KL01 | 20 |
100H/22TR2 | KL01 | 20 |
105JK/12PK1 | AA05 | 10 |
105JK/12PK2 | AA05 | 20 |
105JH/33PK3 | AA05 | 50 |
105JH/33PK4 | AA05 | 30 |
110P/1 | BR03 | 20 |
110P/2 | BR03 | 10 |
I need output like the below dataframe, can anyone pls help me on this
SCHDULE | ID | VALUE |
---|---|---|
100H/10AR1 | KL01 | 30 |
100H/10TR2 | KL01 | 40 |
100H/22TR1 | KL01 | 20 |
100H/22TR2 | KL01 | 20 |
105JK/12PK1 | AA05 | 10 |
105JK/12PK2 | AA05 | 20 |
105JH/33PK3 | AA05 | 50 |
105JH/33PK4 | AA05 | 30 |
110P/1 | BR03 | 20 |
110P/2 | BR03 | 10 |
答案1
得分: 0
以下是翻译好的部分:
"Probably you dont need udf here, use function from Spark API, in this case regexp_extract may be usefull, below you can find sample code and regexp" 可能你这里不需要使用 UDF,可以使用 Spark API 中的函数,在这种情况下,regexp_extract 可能会很有用,下面你可以找到示例代码和正则表达式
"import org.apache.spark.sql.functions." 导入 org.apache.spark.sql.functions.
"val inputData = Seq(" 定义 inputData 序列:
""100H/10AR1"," ""105J/33PK4"," ""110P/1"" "100H/10AR1", "105J/33PK4", "110P/1")
"val inputDf = inputData.toDF("SCHDULE")" 定义 inputDf 数据框,将 inputData 转换为 DataFrame,并将列名设为 "SCHDULE"。
"inputDf.withColumn("Trimmed", regexp_extract($"SCHDULE","""^(\d+[A-Z]?/\d+).""",1)).show" 对 inputDf 执行 withColumn 操作,使用 regexp_extract 函数从 "SCHDULE" 列中提取匹配正则表达式 "^(\d+[A-Z]?/\d+)." 的第一个分组,并将结果存储在名为 "Trimmed" 的新列中,然后显示 DataFrame。
"Output:" 输出结果:
"+----------+-------+" 数据框的列名和分隔线
"| SCHDULE|Trimmed|" "SCHDULE" 和 "Trimmed" 列名
"+----------+-------+" 分隔线
"|100H/10AR1|100H/10|" 数据行
"|105J/33PK4|105J/33|"
"| 110P/1| 110P/1|"
"+----------+-------+" 分隔线
英文:
Probably you dont need udf here, use function from Spark API, in this case regexp_extract may be usefull, below you can find sample code and regexp
import org.apache.spark.sql.functions._
val inputData = Seq(
"100H/10AR1",
"105J/33PK4",
"110P/1"
)
val inputDf = inputData.toDF("SCHDULE")
inputDf.withColumn("Trimmed", regexp_extract($"SCHDULE","""^(\d+[A-Z]?\/\d+).*""",1)).show
Output:
+----------+-------+
| SCHDULE|Trimmed|
+----------+-------+
|100H/10AR1|100H/10|
|105J/33PK4|105J/33|
| 110P/1| 110P/1|
+----------+-------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论