英文:
Java Spark SQL: Merging and overwriting Datasets with identical schema
问题
Java 11 和 Spark SQL 2.13:3.3.2 这里。请注意:我正在使用和关注 Java API,并希望得到 Java 的答案,但我也可以“可能”解释基于 Scala 的答案并进行必要的 Scala 到 Java 转换。但我更希望得到 Java 的答案!
我有2个Dataset<Row>
实例,它们具有相同的架构(相同的列/标头,类型也相同):
数据集 #1 (ds1)
===
水果,数量
--------------
苹果,50
梨,12
橙子,0
奇异果,104
数据集 #2 (ds2)
===
id,水果,数量
--------------
香蕉,50
菠萝,25
橙子,5
蓝莓,15
我想要“合并”这两个Dataset<Row>
,使它们追加或连接在一起,但以这样的方式,使ds2
中的任何与ds1
的水果
列匹配的值都会被覆盖。因此,橙子
在两个数据集中都存在,但在ds2
中其数量为5
,因此应该是最终Dataset<Row>
输出中的列表。换句话说,这个操作应该得到一个第三个数据集,如下所示:
数据集 #3 (ds3)
===
id,水果,数量
--------------
苹果,50
梨,12
橙子,5
奇异果,104
香蕉,50
菠萝,25
蓝莓,15
行的顺序不重要,对我来说重要的是两个数据集中的水果列表在第三个数据集中列出,并且ds1
的行更新(而不是插入),如果在ds2
中存在匹配的水果。
我查看了Dataset#join
的Java文档,但它们似乎只对需要SQL中的“内连接”等效操作有用,但无法帮助我实现所需的覆盖功能。
非常感谢您提供的任何帮助!
英文:
Java 11 and Spark SQL 2.13:3.3.2 here. Please note: I'm using and interested in the Java API and would appreciate Java answers, but I can probably decipher Scala-based answers and do the necessary Scala-to-Java conversions if necessary. But Java would be appreciated!
I have 2 Dataset<Row>
instances, both with the same exact schema (same columns/headers, which are the same types):
data set #1 (ds1)
===
fruit,quantity
--------------
apple,50
pear,12
orange,0
kiwi,104
data set #2 (ds2)
===
id,fruit,quantity
--------------
banana,50
pineapple,25
orange,5
blueberry,15
I would like to "merge" these 2 Dataset<Row>
s so that they are appended or joined to one another, but in such a way that ds2
overwrites any values in ds1
if their fruit
columns match. So orange
is in both data sets, but in ds2
its quantity is 5
, so that should be the final listing in the resultant Dataset<Row>
output. So in other words, this operation should results in a 3rd data set like so:
data set #2 (ds2)
===
id,fruit,quantity
--------------
apple,50
pear,12
orange,5
kiwi,104
banana,50
pineapple,25
blueberry,15
The order of the rows does not matter, all that matters to me is that the list of fruits in both data sets are listed in the 3rd, and that ds1
rows are updated (not inserted) if there is a matching fruit in ds2
.
I took a look at Dataset#join
JavaDocs, but they seem to be just useful for when you need the SQL equivalent of an inner join
, but won't help me with the desired overwrite functionality.
Thanks in advance for any and all help!
答案1
得分: 1
可以使用“full”连接获取两个数据集的所有值,然后使用coalesce获取首选列值:
ds2 = ds2.withColumnRenamed("fruit", "fruit2").withColumnRenamed("quantity", "quantity2")
ds1.join(ds2, ds1.col("fruit").equalTo(ds2.col("fruit2")), "full")
.withColumn("fruit", functions.coalesce(col("fruit"), col("fruit2")))
.withColumn("quantity", functions.coalesce(col("quantity2"), col("quantity")))
.drop("fruit2", "quantity2")
.show()
结果:
+---------+--------+
| fruit|quantity|
+---------+--------+
| kiwi| 104|
| orange| 5|
| apple| 50|
| pear| 12|
| banana| 50|
|pineapple| 25|
|blueberry| 15|
+---------+--------+
英文:
You can use "full" join to get all values from both dataset, then use coalesce to get the preferred column values:
ds2 = ds2.withColumnRenamed("fruit", "fruit2").withColumnRenamed("quantity", "quantity2");
ds1.join(ds2, ds1.col("fruit").equalTo(ds2.col("fruit2")), "full")
.withColumn("fruit", functions.coalesce(col("fruit"), col("fruit2")))
.withColumn("quantity", functions.coalesce(col("quantity2"), col("quantity")))
.drop("fruit2", "quantity2")
.show();
Result:
+---------+--------+
| fruit|quantity|
+---------+--------+
| kiwi| 104|
| orange| 5|
| apple| 50|
| pear| 12|
| banana| 50|
|pineapple| 25|
|blueberry| 15|
+---------+--------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论