2023年2月18日 00:29:35go评论81阅读模式

英文:

Spark Combining Disparate rate Dataframes in Time

问题

使用Spark和Scala，我有两个包含数据值的DataFrame。
我试图完成一项任务，在串行处理时可能很简单，但在集群中处理时似乎很困难。
假设我有两组值。其中一组非常规律：

相对时间	值1
10	1
20	2
30	3

而我想将它与另一个非常不规则的值相结合：

相对时间	值2
1	100
22	200

并得到这样的结果（由值1驱动）：

相对时间	值1	值2
10	1	100
20	2	100
30	3	200

注意：这里有几种情景。其中一种情况是值1是一个庞大的DataFrame，而值2只有几百个值。另一种情况是它们都很庞大。

还要注意：我描述值2可能很慢，但它也可能比值1快得多，因此在下一个值1之前可能有10或100个值2，并且我希望获取最新的值。因此，对它们执行联合操作并进行窗口化似乎不太实际。

我该如何在Spark中完成这个任务？

英文:

Using Spark and Scala, I have two DataFrames with data values.
I'm trying to accomplish something that, when processing serially would be trival, but when processing in a cluster seems daunting.
Let's say I have to sets of values. One of them is very regular:

Relative Time	Value1
10	1
20	2
30	3

And I want to combine it with another value that is very irregular:

Relative Time	Value2
1	100
22	200

And get this (driven by Value1):

Relative Time	Value1	Value2
10	1	100
20	2	100
30	3	200

Note: There are a few scenarios here. One of them is that Value1 is a massive DataFrame and Value2 only has a few hundred values. The other scenario is that they're both massive.

Also note: I depict Value2 as being very slow, and it might be, but also could may be much faster than Value1, so I may have 10 or 100 values of Value2 before my next value of Value1, and I'd want the latest. Because of this doing a union of them and windowing it doesn't seem practical.

How would I accomplish this in Spark?

答案1

得分: 1

我认为你可以执行以下操作：

在两个表之间执行全外连接。
使用 last 函数来查找最接近 value2 的值。

import spark.implicits._
import org.apache.spark.sql.expressions.Window
val df1 = spark.sparkContext.parallelize(Seq(
  (10, 1),
  (20, 2),
  (30, 3)
)).toDF("相对时间", "值1")
val df2 = spark.sparkContext.parallelize(Seq(
  (1, 100),
  (22, 200)
)).toDF("相对时间", "值2临时")
val df = df1.join(df2, Seq("相对时间"), "outer")
val window = Window.orderBy("相对时间")
val result = df.withColumn("值2", last($"值2临时", ignoreNulls = true).over(window)).filter($"值1".isNotNull).drop("值2临时")
result.show()

+-------------+------+------+
|相对时间|值1|值2|
+-------------+------+------+
|           10|     1|   100|
|           20|     2|   100|
|           30|     3|   200|
+-------------+------+------+

英文:

I think you can do:

Full outer join between the two tables
Use the last function to look back the closest value of value2

import spark.implicits._
import org.apache.spark.sql.expressions.Window
val df1 = spark.sparkContext.parallelize(Seq(
  (10, 1),
  (20, 2),
  (30, 3)
)).toDF(&quot;Relative Time&quot;, &quot;value1&quot;)
val df2 = spark.sparkContext.parallelize(Seq(
  (1, 100),
  (22, 200)
)).toDF(&quot;Relative Time&quot;, &quot;value2_temp&quot;)
val df = df1.join(df2, Seq(&quot;Relative Time&quot;), &quot;outer&quot;)
val window = Window.orderBy(&quot;Relative Time&quot;)
val result = df.withColumn(&quot;value2&quot;, last($&quot;value2_temp&quot;, ignoreNulls = true).over(window)).filter($&quot;value1&quot;.isNotNull).drop(&quot;value2_temp&quot;)
result.show()

+-------------+------+------+
|Relative Time|value1|value2|
+-------------+------+------+
|           10|     1|   100|
|           20|     2|   100|
|           30|     3|   200|
+-------------+------+------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark Combining Disparate rate Dataframes in Time

问题

答案1

将Dataset<Row>转换为要添加到Kafka的键和值。

Comments convention "// $example on:" and "// $example off:" in Scala and Java

分组 Spark 数据框并将聚合数据转换为字符串。

为什么读取文件名中包含’：’的测试资源会导致NPE？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。