如何使用Spark SQL将表格按照固定参数连接成平面结构

huangapple go评论100阅读模式
英文:

How to Join tables flat with fixed params with Spark SQL

问题

我有一个在Spark中的数据集,其中有一些列。我想要通过保留一些列作为参考(id、date、name)来扁平地连接这些数据集。

数据集1:(d1)

id date name val1
1 15-06-2022 n1 10
2 15-06-2022 n1 12
3 15-06-2022 n1 24
4 16-09-2021 n2 27

数据集2:(d2)

id date name val2
1 15-06-2022 n1 15
1 12-06-2021 n1 123
2 15-06-2022 n2 56
15 21-03-2022 n1 43
4 16-09-2021 n2 65

输出

id date name val1 val2
1 15-06-2022 n1 10 15
1 12-06-2021 n1 0 123
2 15-06-2022 n1 12 0
2 15-06-2022 n2 0 56
3 15-06-2022 n1 24 0
4 16-09-2021 n2 27 65
15 21-03-2022 n1 0 43

我将使id、date、name列保持不变,如果这三列都相同,那么我将进行扁平连接,否则在空值的情况下创建新行,并将值设置为0。

请帮助我使用Spark中的适当SQL或数据集逻辑来获得所需的输出。

英文:

I have datesets in spark with certain columns. I want to join those dataset flat by keeping some columns as reference(id, date, name).

DATASET 1: (d1)

id date name val1
1 15-06-2022 n1 10
2 15-06-2022 n1 12
3 15-06-2022 n1 24
4 16-09-2021 n2 27

DATASET 1: (d2)

id date name val2
1 15-06-2022 n1 15
1 12-06-2021 n1 123
2 15-06-2022 n2 56
15 21-03-2022 n1 43
4 16-09-2021 n2 65

Output

id date name val1 val2
1 15-06-2022 n1 10 15
1 12-06-2021 n1 0 123
2 15-06-2022 n1 12 0
2 15-06-2022 n2 0 56
3 15-06-2022 n1 24 0
4 16-09-2021 n2 27 65
15 21-03-2022 n1 0 43

I will make the id, data, name columns fixed, If all three are same then I will do flat join, else create new row with the values being 0 incase of null

Please help me with appropriate sql or dataset logic in spark to get the desired output.

答案1

得分: 1

使用基于Id,date,namefull outer join来将两个数据集的所有数据合并,如果有匹配的记录,则根据Id、date和name进行连接。使用coalesce来避免空值。

declare @Datset1 table (
    Id int,
    date varchar(120),
    name varchar(10),
    val1 int
);

insert into @Datset1
values(1,'15-06-2022','n1',10),
(2,'15-06-2022','n1',12),
(3,'15-06-2022','n1',24),
(4,'16-09-2021','n2',27);


declare @Datset2 table (
    Id int,
    date varchar(120),
    name varchar(10),
    val2 int
);
insert into @Datset2
values(1 ,'15-06-2022','n1', 15),
(1 ,'12-06-2021','n1', 123),
(2 ,'15-06-2022','n2', 56),
(15 ,'21-03-2022','n1', 43),
(4 ,'16-09-2021','n2', 65);

select coalesce(dt1.Id,dt2.Id) as Id,
coalesce(dt1.date,dt2.date) as date,
coalesce(dt1.name,dt2.name) as name,
coalesce(dt1.val1,0) as val1,
coalesce(dt2.val2,0) as val2
from @Datset1 dt1 full outer join @Datset2 dt2 on dt1.Id=dt2.Id and dt1.date=dt2.date and dt1.name=dt2.name;

输出如下:

Id date name val1 val2
1 15-06-2022 n1 10 15
2 15-06-2022 n1 12 0
3 15-06-2022 n1 24 0
4 16-09-2021 n2 27 65
1 12-06-2021 n1 0 123
2 15-06-2022 n2 0 56
15 21-03-2022 n1 0 43
英文:

use full outer join based on Id,date,name to Result all data of two data sets if matching Records are there it will join based on Id,date,name. use coalesce to avoid null's

	declare @Datset1 table (
	Id int,
	date varchar(120),
	name varchar(10),
	val1 int
	);

	insert into @Datset1
	values(1,'15-06-2022','n1',10),
	(2,'15-06-2022','n1',12),
	(3,'15-06-2022','n1',24),
	(4,'16-09-2021','n2',27);


	declare @Datset2 table (
	Id int,
	date varchar(120),
	name varchar(10),
	val2 int
	);
	insert into @Datset2
	values(1	,'15-06-2022','n1',	15),
	(1	,'12-06-2021','n1',	123),
	(2	,'15-06-2022','n2',	56),
	(15	,'21-03-2022','n1',	43),
	(4	,'16-09-2021','n2',	65);

	select coalesce(dt1.Id,dt2.Id) as Id,
	coalesce(dt1.date,dt2.date) as date,
	coalesce(dt1.name,dt2.name) as name,
	coalesce(dt1.val1,0) as val1,
	coalesce(dt2.val2,0) as val2
	from @Datset1 dt1 full outer join @Datset2 dt2 on dt1.Id=dt2.Id and dt1.date=dt2.date and dt1.name=dt2.name;

Output will be as follows

Id date name val1 val2
1 15-06-2022 n1 10 15
2 15-06-2022 n1 12 0
3 15-06-2022 n1 24 0
4 16-09-2021 n2 27 65
1 12-06-2021 n1 0 123
2 15-06-2022 n2 0 56
15 21-03-2022 n1 0 43

huangapple
  • 本文由 发表于 2023年7月13日 16:51:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76677547.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定