2023年4月7日 00:09:34go评论145阅读模式

英文:

Java Spark SQL: Merging and overwriting Datasets with identical schema

问题

Java 11 和 Spark SQL 2.13:3.3.2 这里。请注意：我正在使用和关注 Java API，并希望得到 Java 的答案，但我也可以“可能”解释基于 Scala 的答案并进行必要的 Scala 到 Java 转换。但我更希望得到 Java 的答案！

我有2个Dataset<Row>实例，它们具有相同的架构（相同的列/标头，类型也相同）：

数据集 #1 (ds1)
===
水果,数量
--------------
苹果,50
梨,12
橙子,0
奇异果,104

数据集 #2 (ds2)
===
id,水果,数量
--------------
香蕉,50
菠萝,25
橙子,5
蓝莓,15

我想要“合并”这两个Dataset<Row>，使它们追加或连接在一起，但以这样的方式，使ds2中的任何与ds1的水果列匹配的值都会被覆盖。因此，橙子在两个数据集中都存在，但在ds2中其数量为5，因此应该是最终Dataset<Row>输出中的列表。换句话说，这个操作应该得到一个第三个数据集，如下所示：

数据集 #3 (ds3)
===
id,水果,数量
--------------
苹果,50
梨,12
橙子,5
奇异果,104
香蕉,50
菠萝,25
蓝莓,15

行的顺序不重要，对我来说重要的是两个数据集中的水果列表在第三个数据集中列出，并且ds1的行更新（而不是插入），如果在ds2中存在匹配的水果。

我查看了Dataset#join的Java文档，但它们似乎只对需要SQL中的“内连接”等效操作有用，但无法帮助我实现所需的覆盖功能。

非常感谢您提供的任何帮助！

英文:

Java 11 and Spark SQL 2.13:3.3.2 here. Please note: I'm using and interested in the Java API and would appreciate Java answers, but I can probably decipher Scala-based answers and do the necessary Scala-to-Java conversions if necessary. But Java would be appreciated!

I have 2 Dataset<Row> instances, both with the same exact schema (same columns/headers, which are the same types):

data set #1 (ds1)
===
fruit,quantity
--------------
apple,50
pear,12
orange,0
kiwi,104

data set #2 (ds2)
===
id,fruit,quantity
--------------
banana,50
pineapple,25
orange,5
blueberry,15

I would like to "merge" these 2 Dataset<Row>s so that they are appended or joined to one another, but in such a way that ds2 overwrites any values in ds1 if their fruit columns match. So orange is in both data sets, but in ds2 its quantity is 5, so that should be the final listing in the resultant Dataset<Row> output. So in other words, this operation should results in a 3rd data set like so:

data set #2 (ds2)
===
id,fruit,quantity
--------------
apple,50
pear,12
orange,5
kiwi,104
banana,50
pineapple,25
blueberry,15

The order of the rows does not matter, all that matters to me is that the list of fruits in both data sets are listed in the 3rd, and that ds1 rows are updated (not inserted) if there is a matching fruit in ds2.

I took a look at Dataset#join JavaDocs, but they seem to be just useful for when you need the SQL equivalent of an inner join, but won't help me with the desired overwrite functionality.

Thanks in advance for any and all help!

答案1

得分: 1

可以使用“full”连接获取两个数据集的所有值，然后使用coalesce获取首选列值：

ds2 = ds2.withColumnRenamed("fruit", "fruit2").withColumnRenamed("quantity", "quantity2")
ds1.join(ds2, ds1.col("fruit").equalTo(ds2.col("fruit2")), "full")
        .withColumn("fruit", functions.coalesce(col("fruit"), col("fruit2")))
        .withColumn("quantity", functions.coalesce(col("quantity2"), col("quantity")))
        .drop("fruit2", "quantity2")
        .show()

结果：

+---------+--------+
|    fruit|quantity|
+---------+--------+
|     kiwi|     104|
|   orange|       5|
|    apple|      50|
|     pear|      12|
|   banana|      50|
|pineapple|      25|
|blueberry|      15|
+---------+--------+

英文:

You can use "full" join to get all values from both dataset, then use coalesce to get the preferred column values:

ds2 = ds2.withColumnRenamed(&quot;fruit&quot;, &quot;fruit2&quot;).withColumnRenamed(&quot;quantity&quot;, &quot;quantity2&quot;);
ds1.join(ds2, ds1.col(&quot;fruit&quot;).equalTo(ds2.col(&quot;fruit2&quot;)), &quot;full&quot;)
        .withColumn(&quot;fruit&quot;, functions.coalesce(col(&quot;fruit&quot;), col(&quot;fruit2&quot;)))
        .withColumn(&quot;quantity&quot;, functions.coalesce(col(&quot;quantity2&quot;), col(&quot;quantity&quot;)))
        .drop(&quot;fruit2&quot;, &quot;quantity2&quot;)
        .show();

Result:

+---------+--------+
|    fruit|quantity|
+---------+--------+
|     kiwi|     104|
|   orange|       5|
|    apple|      50|
|     pear|      12|
|   banana|      50|
|pineapple|      25|
|blueberry|      15|
+---------+--------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Java Spark SQL: 合并和覆盖具有相同模式的数据集

问题

答案1

在多个UI线程下运行Java应用程序。

java.lang.IllegalArgumentException: AsyncPagedListDiffer cannot handle both contiguous and non-contiguous lists

Stripe PaymentIntent API可以使用自定义的卡片数据吗？

使Java Scanner程序更加健壮？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论