在Spark Dataframe中跨行协调ID列的棘手操作

huangapple go评论52阅读模式
英文:

Tricky harmonizing of ID columns across rows in Spark Dataframe

问题

以下是翻译好的部分:

我有一组行数据,其中每个事件行都通过"EventId"唯一标识。一组事件属于一个由"GUID"和"WFID"标识的组。
问题是,大多数事件不会在同一个事件中同时获得这两个标识。

以下是一个示例。只有"WF3"同时具有"GUID"和"WFID"。因此,需要在其他候选事件(WF1到WF6)之间协调这些标识:

val df= Seq(
    ("GUID1", "", "WF1", "01-01-2023"),
    ("GUID1", "", "WF2", "01-02-2023"),
    ("GUID1", "WFID1", "WF3", "01-03-2023"),
    ("GUID1", "", "WF4", "01-04-2023"),
    ("", "WFID1", "WF5", "01-05-2023"),
    ("GUID1", "", "WF6", "01-06-2023"),
    ("GUID2", "", "WF7", "01-07-2023"),
    ("", "WFID2", "WF8", "01-08-2023")
).toDF("GUID", "WFID", "EventId", "Time")
df.show

需求是获取所有候选事件中的GUID和WFID,以便事件组具有相同的GUID和WFID。
在上述示例中,预期的输出应该是:

+-----+-----+-------+----------+
| GUID| WFID|EventId|      Time|
+-----+-----+-------+----------+
|GUID1|WFID1|    WF1|01-01-2023|
|GUID1|WFID1|    WF2|01-02-2023|
|GUID1|WFID1|    WF3|01-03-2023|
|GUID1|WFID1|    WF4|01-04-2023|
|GUID1|WFID1|    WF5|01-05-2023|
|GUID1|WFID1|    WF6|01-06-2023|
|GUID2|     |    WF7|01-07-2023|
|     |WFID2|    WF8|01-08-2023|
+-----+-----+-------+----------+

有没有想过如何在Spark中实现这个需求而不使用UDF?

英文:

I have a set of rows, where each event row is uniquely identified by "EventId". A set of events belong to a group, identified by "GUID" and "WFID".
Problem is, most of the events do not get both the IDs together in the same event.

An example is below. Only "WF3" has both "GUID" and "WFID". From this, the IDs need to be harmonized across other candidate events (WF1 to WF6):

val df= Seq(
("GUID1",	"", 	 "WF1", "01-01-2023"),
("GUID1",	"", 	 "WF2", "01-02-2023"),
("GUID1",	"WFID1", "WF3", "01-03-2023"),
("GUID1",	"", 	 "WF4", "01-04-2023"),
(""		  ,	"WFID1", "WF5", "01-05-2023"),
("GUID1",	"", 	 "WF6", "01-06-2023"),
("GUID2",	"", 	 "WF7", "01-07-2023"),
("",		"WFID2", "WF8", "01-08-2023")
).toDF("GUID", "WFID", "EventId", "Time")
df.show

+-----+-----+-------+----------+
| GUID| WFID|EventId|      Time|
+-----+-----+-------+----------+
|GUID1|     |    WF1|01-01-2023|
|GUID1|     |    WF2|01-02-2023|
|GUID1|WFID1|    WF3|01-03-2023|
|GUID1|     |    WF4|01-04-2023|
|     |WFID1|    WF5|01-05-2023|
|GUID1|     |    WF6|01-06-2023|
|GUID2|     |    WF7|01-07-2023|
|     |WFID2|    WF8|01-08-2023|
+-----+-----+-------+----------+

The requirement is to get the GUID and WFID across all candidate events so that the group of events have the same GUID and WFID.
The expected output in the above example should be :

+-----+-----+-------+----------+
| GUID| WFID|EventId|      Time|
+-----+-----+-------+----------+
|GUID1|WFID1|    WF1|01-01-2023|
|GUID1|WFID1|    WF2|01-02-2023|
|GUID1|WFID1|    WF3|01-03-2023|
|GUID1|WFID1|    WF4|01-04-2023|
|GUID1|WFID1|    WF5|01-05-2023|
|GUID1|WFID1|    WF6|01-06-2023|
|GUID2|     |    WF7|01-07-2023|
|     |WFID2|    WF8|01-08-2023|
+-----+-----+-------+----------+

Any idea how this can be implemented in Spark without using UDF?

答案1

得分: 0

以下是工作解决方案。如果有不需要执行任何连接操作的解决方案,请告诉我!

val dfDistinct = df.filter(col("GUID") =!= "" && col("WFID") =!= "").select(col("GUID").as("GUID1"), col("WFID").as("WFID1")).distinct()

df.join(dfDistinct, df("GUID") === dfDistinct("GUID1") || df("WFID") === dfDistinct("WFID1"), "left")
  .withColumn("GUIDnew", when(col("GUID1").isNotNull, col("GUID1")).otherwise(col("GUID")))
  .withColumn("WFIDnew", when(col("WFID1").isNotNull, col("WFID1")).otherwise(col("WFID")))
  .select(col("GUIDnew").as("GUID"), col("WFIDnew").as("WFID"), col("EventId"), col("Time"))
  .show
+-----+-----+-------+----------+
| GUID| WFID|EventId|      Time|
+-----+-----+-------+----------+
|GUID1|WFID1|    WF1|01-01-2023|
|GUID1|WFID1|    WF2|01-02-2023|
|GUID1|WFID1|    WF3|01-03-2023|
|GUID1|WFID1|    WF4|01-04-2023|
|GUID1|WFID1|    WF5|01-05-2023|
|GUID1|WFID1|    WF6|01-06-2023|
|GUID2|     |    WF7|01-07-2023|
|     |WFID2|    WF8|01-08-2023|
+-----+-----+-------+----------+
英文:

Here is the working solution. Let me know if have a solution without doing any join!

val dfDistinct = df.filter(col("GUID") =!= "" && col("WFID") =!= "").select(col("GUID").as("GUID1"), col("WFID").as("WFID1")).distinct()

df.join(dfDistinct, df("GUID") === dfDistinct("GUID1") || df("WFID") === dfDistinct("WFID1"), "left")
.withColumn("GUIDnew", when(col("GUID1").isNotNull, col("GUID1")).otherwise(col("GUID")))
.withColumn("WFIDnew", when(col("WFID1").isNotNull, col("WFID1")).otherwise(col("WFID")))
.select(col("GUIDnew").as("GUID"), col("WFIDnew").as("WFID"), col("EventId"), col("Time"))
.show

+-----+-----+-------+----------+
| GUID| WFID|EventId|      Time|
+-----+-----+-------+----------+
|GUID1|WFID1|    WF1|01-01-2023|
|GUID1|WFID1|    WF2|01-02-2023|
|GUID1|WFID1|    WF3|01-03-2023|
|GUID1|WFID1|    WF4|01-04-2023|
|GUID1|WFID1|    WF5|01-05-2023|
|GUID1|WFID1|    WF6|01-06-2023|
|GUID2|     |    WF7|01-07-2023|
|     |WFID2|    WF8|01-08-2023|
+-----+-----+-------+----------+

huangapple
  • 本文由 发表于 2023年5月13日 16:23:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76241772.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定