问题

以下是翻译好的部分：

我有一组行数据，其中每个事件行都通过"EventId"唯一标识。一组事件属于一个由"GUID"和"WFID"标识的组。
问题是，大多数事件不会在同一个事件中同时获得这两个标识。

以下是一个示例。只有"WF3"同时具有"GUID"和"WFID"。因此，需要在其他候选事件（WF1到WF6）之间协调这些标识：

val df= Seq(
    ("GUID1", "", "WF1", "01-01-2023"),
    ("GUID1", "", "WF2", "01-02-2023"),
    ("GUID1", "WFID1", "WF3", "01-03-2023"),
    ("GUID1", "", "WF4", "01-04-2023"),
    ("", "WFID1", "WF5", "01-05-2023"),
    ("GUID1", "", "WF6", "01-06-2023"),
    ("GUID2", "", "WF7", "01-07-2023"),
    ("", "WFID2", "WF8", "01-08-2023")
).toDF("GUID", "WFID", "EventId", "Time")
df.show

需求是获取所有候选事件中的GUID和WFID，以便事件组具有相同的GUID和WFID。
在上述示例中，预期的输出应该是：

+-----+-----+-------+----------+
| GUID| WFID|EventId|      Time|
+-----+-----+-------+----------+
|GUID1|WFID1|    WF1|01-01-2023|
|GUID1|WFID1|    WF2|01-02-2023|
|GUID1|WFID1|    WF3|01-03-2023|
|GUID1|WFID1|    WF4|01-04-2023|
|GUID1|WFID1|    WF5|01-05-2023|
|GUID1|WFID1|    WF6|01-06-2023|
|GUID2|     |    WF7|01-07-2023|
|     |WFID2|    WF8|01-08-2023|
+-----+-----+-------+----------+

有没有想过如何在Spark中实现这个需求而不使用UDF？

英文:

I have a set of rows, where each event row is uniquely identified by "EventId". A set of events belong to a group, identified by "GUID" and "WFID".
Problem is, most of the events do not get both the IDs together in the same event.

An example is below. Only "WF3" has both "GUID" and "WFID". From this, the IDs need to be harmonized across other candidate events (WF1 to WF6):

val df= Seq(
(&quot;GUID1&quot;,	&quot;&quot;, 	 &quot;WF1&quot;, &quot;01-01-2023&quot;),
(&quot;GUID1&quot;,	&quot;&quot;, 	 &quot;WF2&quot;, &quot;01-02-2023&quot;),
(&quot;GUID1&quot;,	&quot;WFID1&quot;, &quot;WF3&quot;, &quot;01-03-2023&quot;),
(&quot;GUID1&quot;,	&quot;&quot;, 	 &quot;WF4&quot;, &quot;01-04-2023&quot;),
(&quot;&quot;		  ,	&quot;WFID1&quot;, &quot;WF5&quot;, &quot;01-05-2023&quot;),
(&quot;GUID1&quot;,	&quot;&quot;, 	 &quot;WF6&quot;, &quot;01-06-2023&quot;),
(&quot;GUID2&quot;,	&quot;&quot;, 	 &quot;WF7&quot;, &quot;01-07-2023&quot;),
(&quot;&quot;,		&quot;WFID2&quot;, &quot;WF8&quot;, &quot;01-08-2023&quot;)
).toDF(&quot;GUID&quot;, &quot;WFID&quot;, &quot;EventId&quot;, &quot;Time&quot;)
df.show

+-----+-----+-------+----------+
| GUID| WFID|EventId|      Time|
+-----+-----+-------+----------+
|GUID1|     |    WF1|01-01-2023|
|GUID1|     |    WF2|01-02-2023|
|GUID1|WFID1|    WF3|01-03-2023|
|GUID1|     |    WF4|01-04-2023|
|     |WFID1|    WF5|01-05-2023|
|GUID1|     |    WF6|01-06-2023|
|GUID2|     |    WF7|01-07-2023|
|     |WFID2|    WF8|01-08-2023|
+-----+-----+-------+----------+

The requirement is to get the GUID and WFID across all candidate events so that the group of events have the same GUID and WFID.
The expected output in the above example should be :

+-----+-----+-------+----------+
| GUID| WFID|EventId|      Time|
+-----+-----+-------+----------+
|GUID1|WFID1|    WF1|01-01-2023|
|GUID1|WFID1|    WF2|01-02-2023|
|GUID1|WFID1|    WF3|01-03-2023|
|GUID1|WFID1|    WF4|01-04-2023|
|GUID1|WFID1|    WF5|01-05-2023|
|GUID1|WFID1|    WF6|01-06-2023|
|GUID2|     |    WF7|01-07-2023|
|     |WFID2|    WF8|01-08-2023|
+-----+-----+-------+----------+

Any idea how this can be implemented in Spark without using UDF?

答案1

得分: 0

以下是工作解决方案。如果有不需要执行任何连接操作的解决方案，请告诉我！

val dfDistinct = df.filter(col("GUID") =!= "" && col("WFID") =!= "").select(col("GUID").as("GUID1"), col("WFID").as("WFID1")).distinct()

df.join(dfDistinct, df("GUID") === dfDistinct("GUID1") || df("WFID") === dfDistinct("WFID1"), "left")
  .withColumn("GUIDnew", when(col("GUID1").isNotNull, col("GUID1")).otherwise(col("GUID")))
  .withColumn("WFIDnew", when(col("WFID1").isNotNull, col("WFID1")).otherwise(col("WFID")))
  .select(col("GUIDnew").as("GUID"), col("WFIDnew").as("WFID"), col("EventId"), col("Time"))
  .show

+-----+-----+-------+----------+
| GUID| WFID|EventId|      Time|
+-----+-----+-------+----------+
|GUID1|WFID1|    WF1|01-01-2023|
|GUID1|WFID1|    WF2|01-02-2023|
|GUID1|WFID1|    WF3|01-03-2023|
|GUID1|WFID1|    WF4|01-04-2023|
|GUID1|WFID1|    WF5|01-05-2023|
|GUID1|WFID1|    WF6|01-06-2023|
|GUID2|     |    WF7|01-07-2023|
|     |WFID2|    WF8|01-08-2023|
+-----+-----+-------+----------+

英文:

Here is the working solution. Let me know if have a solution without doing any join!

val dfDistinct = df.filter(col(&quot;GUID&quot;) =!= &quot;&quot; &amp;&amp; col(&quot;WFID&quot;) =!= &quot;&quot;).select(col(&quot;GUID&quot;).as(&quot;GUID1&quot;), col(&quot;WFID&quot;).as(&quot;WFID1&quot;)).distinct()

df.join(dfDistinct, df(&quot;GUID&quot;) === dfDistinct(&quot;GUID1&quot;) || df(&quot;WFID&quot;) === dfDistinct(&quot;WFID1&quot;), &quot;left&quot;)
.withColumn(&quot;GUIDnew&quot;, when(col(&quot;GUID1&quot;).isNotNull, col(&quot;GUID1&quot;)).otherwise(col(&quot;GUID&quot;)))
.withColumn(&quot;WFIDnew&quot;, when(col(&quot;WFID1&quot;).isNotNull, col(&quot;WFID1&quot;)).otherwise(col(&quot;WFID&quot;)))
.select(col(&quot;GUIDnew&quot;).as(&quot;GUID&quot;), col(&quot;WFIDnew&quot;).as(&quot;WFID&quot;), col(&quot;EventId&quot;), col(&quot;Time&quot;))
.show

+-----+-----+-------+----------+
| GUID| WFID|EventId|      Time|
+-----+-----+-------+----------+
|GUID1|WFID1|    WF1|01-01-2023|
|GUID1|WFID1|    WF2|01-02-2023|
|GUID1|WFID1|    WF3|01-03-2023|
|GUID1|WFID1|    WF4|01-04-2023|
|GUID1|WFID1|    WF5|01-05-2023|
|GUID1|WFID1|    WF6|01-06-2023|
|GUID2|     |    WF7|01-07-2023|
|     |WFID2|    WF8|01-08-2023|
+-----+-----+-------+----------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Spark Dataframe中跨行协调ID列的棘手操作

问题

答案1

Case_when 还是 if else？

将制表符分隔的字符串拆分成不同的列。

在列 A 中存在的字符串，使用 np.where() 添加到列 B 中的字符串。

错误从S3存储桶加载数据到Databricks外部表

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论