2023年7月10日 18:14:37go评论144阅读模式

英文:

How to trim the column values using Spark Dataframe

问题

**我有一个类似下面的数据框，我需要使用Spark数据框修剪SCHDULE列中的值。**
*我尝试过使用UDF函数，但我没有得到预期的输出*

| SCHDULE   |  ID   | VALUE |
| --------  | ------|-------|
|100H/10AR1 | KL01  | 30    |
|100H/10TR2 | KL01  | 40    |
|100H/22TR1 | KL01  | 20    |
|100H/22TR2 | KL01  | 20    |
|105JK/12PK1| AA05  | 10    |
|105JK/12PK2| AA05  | 20    |
|105JH/33PK3| AA05  | 50    |
|105JH/33PK4| AA05  | 30    |
|110P/1     | BR03  | 20    |
|110P/2     | BR03  | 10    |


**我需要输出如下的数据框，请有人可以帮助我吗**

| SCHDULE   |  ID   | VALUE |
| ----------| ------|-------|
|100H/10AR1 | KL01  | 30    |
|100H/10TR2 | KL01  | 40    |
|100H/22TR1 | KL01  | 20    |
|100H/22TR2 | KL01  | 20    |
|105JK/12PK1| AA05  | 10    |
|105JK/12PK2| AA05  | 20    |
|105JH/33PK3| AA05  | 50    |
|105JH/33PK4| AA05  | 30    |
|110P/1     | BR03  | 20    |
|110P/2     | BR03  | 10    |

英文:

I have a dataframe like below, and I need to trim the values in SCHDULE column using spark dataframe.
I tried with UDF functions but I didn't get expected output

SCHDULE	ID	VALUE
100H/10AR1	KL01	30
100H/10TR2	KL01	40
100H/22TR1	KL01	20
100H/22TR2	KL01	20
105JK/12PK1	AA05	10
105JK/12PK2	AA05	20
105JH/33PK3	AA05	50
105JH/33PK4	AA05	30
110P/1	BR03	20
110P/2	BR03	10

I need output like the below dataframe, can anyone pls help me on this

SCHDULE	ID	VALUE
100H/10AR1	KL01	30
100H/10TR2	KL01	40
100H/22TR1	KL01	20
100H/22TR2	KL01	20
105JK/12PK1	AA05	10
105JK/12PK2	AA05	20
105JH/33PK3	AA05	50
105JH/33PK4	AA05	30
110P/1	BR03	20
110P/2	BR03	10

答案1

得分: 0

以下是翻译好的部分：

"Probably you dont need udf here, use function from Spark API, in this case regexp_extract may be usefull, below you can find sample code and regexp" 可能你这里不需要使用 UDF，可以使用 Spark API 中的函数，在这种情况下，regexp_extract 可能会很有用，下面你可以找到示例代码和正则表达式

"import org.apache.spark.sql.functions." 导入 org.apache.spark.sql.functions.

"val inputData = Seq(" 定义 inputData 序列：

""100H/10AR1"," ""105J/33PK4"," ""110P/1"" "100H/10AR1", "105J/33PK4", "110P/1")

"val inputDf = inputData.toDF("SCHDULE")" 定义 inputDf 数据框，将 inputData 转换为 DataFrame，并将列名设为 "SCHDULE"。

"inputDf.withColumn("Trimmed", regexp_extract($"SCHDULE","""^(\d+[A-Z]?/\d+).""",1)).show" 对 inputDf 执行 withColumn 操作，使用 regexp_extract 函数从 "SCHDULE" 列中提取匹配正则表达式 "^(\d+[A-Z]?/\d+)." 的第一个分组，并将结果存储在名为 "Trimmed" 的新列中，然后显示 DataFrame。

"Output:" 输出结果：

"+----------+-------+" 数据框的列名和分隔线

"| SCHDULE|Trimmed|" "SCHDULE" 和 "Trimmed" 列名

"+----------+-------+" 分隔线

"|100H/10AR1|100H/10|" 数据行

"|105J/33PK4|105J/33|"

"| 110P/1| 110P/1|"

"+----------+-------+" 分隔线

英文:

Probably you dont need udf here, use function from Spark API, in this case regexp_extract may be usefull, below you can find sample code and regexp

import org.apache.spark.sql.functions._

val inputData = Seq(
  &quot;100H/10AR1&quot;,
  &quot;105J/33PK4&quot;,
  &quot;110P/1&quot;
)

val inputDf = inputData.toDF(&quot;SCHDULE&quot;)
inputDf.withColumn(&quot;Trimmed&quot;, regexp_extract($&quot;SCHDULE&quot;,&quot;&quot;&quot;^(\d+[A-Z]?\/\d+).*&quot;&quot;&quot;,1)).show

Output:

+----------+-------+
|   SCHDULE|Trimmed|
+----------+-------+
|100H/10AR1|100H/10|
|105J/33PK4|105J/33|
|    110P/1| 110P/1|
+----------+-------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Spark Dataframe修剪列值

问题

答案1

如何在Python Pandas中更改`df.plot()`的背景颜色？

Memory issues running spark locally in Intellij (scala)

在Pyspark中应用Mongo的查找查询。

如何在 Spark 数据框中使用 when 和 Otherwise 语句根据布尔列？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论