英文:
Tricky harmonizing of ID columns across rows in Spark Dataframe
问题
以下是翻译好的部分:
我有一组行数据,其中每个事件行都通过"EventId"唯一标识。一组事件属于一个由"GUID"和"WFID"标识的组。
问题是,大多数事件不会在同一个事件中同时获得这两个标识。
以下是一个示例。只有"WF3"同时具有"GUID"和"WFID"。因此,需要在其他候选事件(WF1到WF6)之间协调这些标识:
val df= Seq(
("GUID1", "", "WF1", "01-01-2023"),
("GUID1", "", "WF2", "01-02-2023"),
("GUID1", "WFID1", "WF3", "01-03-2023"),
("GUID1", "", "WF4", "01-04-2023"),
("", "WFID1", "WF5", "01-05-2023"),
("GUID1", "", "WF6", "01-06-2023"),
("GUID2", "", "WF7", "01-07-2023"),
("", "WFID2", "WF8", "01-08-2023")
).toDF("GUID", "WFID", "EventId", "Time")
df.show
需求是获取所有候选事件中的GUID和WFID,以便事件组具有相同的GUID和WFID。
在上述示例中,预期的输出应该是:
+-----+-----+-------+----------+
| GUID| WFID|EventId| Time|
+-----+-----+-------+----------+
|GUID1|WFID1| WF1|01-01-2023|
|GUID1|WFID1| WF2|01-02-2023|
|GUID1|WFID1| WF3|01-03-2023|
|GUID1|WFID1| WF4|01-04-2023|
|GUID1|WFID1| WF5|01-05-2023|
|GUID1|WFID1| WF6|01-06-2023|
|GUID2| | WF7|01-07-2023|
| |WFID2| WF8|01-08-2023|
+-----+-----+-------+----------+
有没有想过如何在Spark中实现这个需求而不使用UDF?
英文:
I have a set of rows, where each event row is uniquely identified by "EventId". A set of events belong to a group, identified by "GUID" and "WFID".
Problem is, most of the events do not get both the IDs together in the same event.
An example is below. Only "WF3" has both "GUID" and "WFID". From this, the IDs need to be harmonized across other candidate events (WF1 to WF6):
val df= Seq(
("GUID1", "", "WF1", "01-01-2023"),
("GUID1", "", "WF2", "01-02-2023"),
("GUID1", "WFID1", "WF3", "01-03-2023"),
("GUID1", "", "WF4", "01-04-2023"),
("" , "WFID1", "WF5", "01-05-2023"),
("GUID1", "", "WF6", "01-06-2023"),
("GUID2", "", "WF7", "01-07-2023"),
("", "WFID2", "WF8", "01-08-2023")
).toDF("GUID", "WFID", "EventId", "Time")
df.show
+-----+-----+-------+----------+
| GUID| WFID|EventId| Time|
+-----+-----+-------+----------+
|GUID1| | WF1|01-01-2023|
|GUID1| | WF2|01-02-2023|
|GUID1|WFID1| WF3|01-03-2023|
|GUID1| | WF4|01-04-2023|
| |WFID1| WF5|01-05-2023|
|GUID1| | WF6|01-06-2023|
|GUID2| | WF7|01-07-2023|
| |WFID2| WF8|01-08-2023|
+-----+-----+-------+----------+
The requirement is to get the GUID and WFID across all candidate events so that the group of events have the same GUID and WFID.
The expected output in the above example should be :
+-----+-----+-------+----------+
| GUID| WFID|EventId| Time|
+-----+-----+-------+----------+
|GUID1|WFID1| WF1|01-01-2023|
|GUID1|WFID1| WF2|01-02-2023|
|GUID1|WFID1| WF3|01-03-2023|
|GUID1|WFID1| WF4|01-04-2023|
|GUID1|WFID1| WF5|01-05-2023|
|GUID1|WFID1| WF6|01-06-2023|
|GUID2| | WF7|01-07-2023|
| |WFID2| WF8|01-08-2023|
+-----+-----+-------+----------+
Any idea how this can be implemented in Spark without using UDF?
答案1
得分: 0
以下是工作解决方案。如果有不需要执行任何连接操作的解决方案,请告诉我!
val dfDistinct = df.filter(col("GUID") =!= "" && col("WFID") =!= "").select(col("GUID").as("GUID1"), col("WFID").as("WFID1")).distinct()
df.join(dfDistinct, df("GUID") === dfDistinct("GUID1") || df("WFID") === dfDistinct("WFID1"), "left")
.withColumn("GUIDnew", when(col("GUID1").isNotNull, col("GUID1")).otherwise(col("GUID")))
.withColumn("WFIDnew", when(col("WFID1").isNotNull, col("WFID1")).otherwise(col("WFID")))
.select(col("GUIDnew").as("GUID"), col("WFIDnew").as("WFID"), col("EventId"), col("Time"))
.show
+-----+-----+-------+----------+
| GUID| WFID|EventId| Time|
+-----+-----+-------+----------+
|GUID1|WFID1| WF1|01-01-2023|
|GUID1|WFID1| WF2|01-02-2023|
|GUID1|WFID1| WF3|01-03-2023|
|GUID1|WFID1| WF4|01-04-2023|
|GUID1|WFID1| WF5|01-05-2023|
|GUID1|WFID1| WF6|01-06-2023|
|GUID2| | WF7|01-07-2023|
| |WFID2| WF8|01-08-2023|
+-----+-----+-------+----------+
英文:
Here is the working solution. Let me know if have a solution without doing any join!
val dfDistinct = df.filter(col("GUID") =!= "" && col("WFID") =!= "").select(col("GUID").as("GUID1"), col("WFID").as("WFID1")).distinct()
df.join(dfDistinct, df("GUID") === dfDistinct("GUID1") || df("WFID") === dfDistinct("WFID1"), "left")
.withColumn("GUIDnew", when(col("GUID1").isNotNull, col("GUID1")).otherwise(col("GUID")))
.withColumn("WFIDnew", when(col("WFID1").isNotNull, col("WFID1")).otherwise(col("WFID")))
.select(col("GUIDnew").as("GUID"), col("WFIDnew").as("WFID"), col("EventId"), col("Time"))
.show
+-----+-----+-------+----------+
| GUID| WFID|EventId| Time|
+-----+-----+-------+----------+
|GUID1|WFID1| WF1|01-01-2023|
|GUID1|WFID1| WF2|01-02-2023|
|GUID1|WFID1| WF3|01-03-2023|
|GUID1|WFID1| WF4|01-04-2023|
|GUID1|WFID1| WF5|01-05-2023|
|GUID1|WFID1| WF6|01-06-2023|
|GUID2| | WF7|01-07-2023|
| |WFID2| WF8|01-08-2023|
+-----+-----+-------+----------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论