Java Spark SQL: 合并和覆盖具有相同模式的数据集

huangapple go评论67阅读模式
英文:

Java Spark SQL: Merging and overwriting Datasets with identical schema

问题

Java 11 和 Spark SQL 2.13:3.3.2 这里。请注意:我正在使用和关注 Java API,并希望得到 Java 的答案,但我也可以“可能”解释基于 Scala 的答案并进行必要的 Scala 到 Java 转换。但我更希望得到 Java 的答案!


我有2个Dataset<Row>实例,它们具有相同的架构(相同的列/标头,类型也相同):

数据集 #1 (ds1)
===
水果,数量
--------------
苹果,50
梨,12
橙子,0
奇异果,104

数据集 #2 (ds2)
===
id,水果,数量
--------------
香蕉,50
菠萝,25
橙子,5
蓝莓,15

我想要“合并”这两个Dataset<Row>,使它们追加或连接在一起,但以这样的方式,使ds2中的任何与ds1水果列匹配的值都会被覆盖。因此,橙子在两个数据集中都存在,但在ds2中其数量为5,因此应该是最终Dataset<Row>输出中的列表。换句话说,这个操作应该得到一个第三个数据集,如下所示:

数据集 #3 (ds3)
===
id,水果,数量
--------------
苹果,50
梨,12
橙子,5
奇异果,104
香蕉,50
菠萝,25
蓝莓,15

行的顺序不重要,对我来说重要的是两个数据集中的水果列表在第三个数据集中列出,并且ds1的行更新(而不是插入),如果在ds2中存在匹配的水果。

我查看了Dataset#join的Java文档,但它们似乎只对需要SQL中的“内连接”等效操作有用,但无法帮助我实现所需的覆盖功能。

非常感谢您提供的任何帮助!

英文:

Java 11 and Spark SQL 2.13:3.3.2 here. Please note: I'm using and interested in the Java API and would appreciate Java answers, but I can probably decipher Scala-based answers and do the necessary Scala-to-Java conversions if necessary. But Java would be appreciated!


I have 2 Dataset<Row> instances, both with the same exact schema (same columns/headers, which are the same types):

data set #1 (ds1)
===
fruit,quantity
--------------
apple,50
pear,12
orange,0
kiwi,104

data set #2 (ds2)
===
id,fruit,quantity
--------------
banana,50
pineapple,25
orange,5
blueberry,15

I would like to "merge" these 2 Dataset<Row>s so that they are appended or joined to one another, but in such a way that ds2 overwrites any values in ds1 if their fruit columns match. So orange is in both data sets, but in ds2 its quantity is 5, so that should be the final listing in the resultant Dataset<Row> output. So in other words, this operation should results in a 3rd data set like so:

data set #2 (ds2)
===
id,fruit,quantity
--------------
apple,50
pear,12
orange,5
kiwi,104
banana,50
pineapple,25
blueberry,15

The order of the rows does not matter, all that matters to me is that the list of fruits in both data sets are listed in the 3rd, and that ds1 rows are updated (not inserted) if there is a matching fruit in ds2.

I took a look at Dataset#join JavaDocs, but they seem to be just useful for when you need the SQL equivalent of an inner join, but won't help me with the desired overwrite functionality.

Thanks in advance for any and all help!

答案1

得分: 1

可以使用“full”连接获取两个数据集的所有值,然后使用coalesce获取首选列值:

ds2 = ds2.withColumnRenamed("fruit", "fruit2").withColumnRenamed("quantity", "quantity2")
ds1.join(ds2, ds1.col("fruit").equalTo(ds2.col("fruit2")), "full")
        .withColumn("fruit", functions.coalesce(col("fruit"), col("fruit2")))
        .withColumn("quantity", functions.coalesce(col("quantity2"), col("quantity")))
        .drop("fruit2", "quantity2")
        .show()

结果:

+---------+--------+
|    fruit|quantity|
+---------+--------+
|     kiwi|     104|
|   orange|       5|
|    apple|      50|
|     pear|      12|
|   banana|      50|
|pineapple|      25|
|blueberry|      15|
+---------+--------+
英文:

You can use "full" join to get all values from both dataset, then use coalesce to get the preferred column values:

ds2 = ds2.withColumnRenamed("fruit", "fruit2").withColumnRenamed("quantity", "quantity2");
ds1.join(ds2, ds1.col("fruit").equalTo(ds2.col("fruit2")), "full")
        .withColumn("fruit", functions.coalesce(col("fruit"), col("fruit2")))
        .withColumn("quantity", functions.coalesce(col("quantity2"), col("quantity")))
        .drop("fruit2", "quantity2")
        .show();

Result:

+---------+--------+
|    fruit|quantity|
+---------+--------+
|     kiwi|     104|
|   orange|       5|
|    apple|      50|
|     pear|      12|
|   banana|      50|
|pineapple|      25|
|blueberry|      15|
+---------+--------+

huangapple
  • 本文由 发表于 2023年4月7日 00:09:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/75951587.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定