你如何在SparkSql中删除重复行?

huangapple go评论61阅读模式
英文:

How could I remove duplicate rows in SparkSql?

问题

以下是您提供的数据和代码的翻译:

我的数据如下:

代码 时间 总值 型号类型 第一状态 第二状态
11111 07/06/2022 06:45:42 23456 MXJ 打开 关闭
11111 07/06/2022 06:45:42 23456 MXJ 打开 关闭
11111 03/02/2022 08:01:11 78231 MXJ 打开 关闭
22222 04/03/2022 13:23:54 20134 MXJ 打开 关闭
22222 04/03/2022 13:23:54 20134 MXJ 打开 关闭

我想要的结果:

代码 时间 总值 型号类型 第一状态 第二状态
11111 07/06/2022 06:45:42 23456 MXJ 打开 关闭
11111 03/02/2022 08:01:11 78231 MXJ 打开 关闭
22222 04/03/2022 13:23:54 20134 MXJ 打开 关闭

我的代码如下:

从
(
  选择
     代码,
     时间,
     型号类型,
     总值,
     第一状态,
     lead(第一状态, 1, null) over(partition by 代码 order by 时间 asc) as 第二状态
  从文件中选择
  其中型号类型 = 'MXJ'
) t 
其中第一状态='打开' 和 第二状态='关闭'
限制 5
英文:

My data is like this:

Code Time Total Value Model Type First Status Second Status
11111 07/06/2022 06:45:42 23456 MXJ Turn On Turn Off
11111 07/06/2022 06:45:42 23456 MXJ Turn On Turn Off
11111 03/02/2022 08:01:11 78231 MXJ Turn On Turn Off
22222 04/03/2022 13:23:54 20134 MXJ Turn On Turn Off
22222 04/03/2022 13:23:54 20134 MXJ Turn On Turn Off

The result I Want:

Code Time Total Value Model Type First Status Second Status
11111 07/06/2022 06:45:42 23456 MXJ Turn On Turn Off
11111 03/02/2022 08:01:11 78231 MXJ Turn On Turn Off
22222 04/03/2022 13:23:54 20134 MXJ Turn On Turn Off

My code is like this:

select * from 
(
  select
     code,
     Time,
     Model Type,
     Total Value,
     First Status,
     lead(First Status, 1, null) over(partition by code order by Time asc) as Second Status
  from file
  where Model Type = 'MXJ'
) t 
where First Status='Turn On' and Second='Turn Off'
limit 5

答案1

得分: 0

以下是翻译好的部分:

"你的问题中的数据不太清晰。然而,有两种方法可以用来去除重复数据。

第一种方法是使用 DISTINCT。所以,如果你想要基于所有列去除重复项,你可以这样做:

SELECT DISTINCT *
FROM <your_table>

如果你只想基于某些列去重复,可以这样:

SELECT DISTINCT <column_1>, <column_2> ..
FROM <your_table>

另一种选择是使用 GROUP BYHAVING。你可以根据想要去重的列进行分组,然后过滤掉计数大于1的行:

SELECT <column_1>, <column_2> ..
FROM <your_table>
GROUP BY <column_1>, <column_2> ..
HAVING COUNT(*) > 1

所以,对于你的情况,我建议使用你已经有的查询创建一个临时视图,然后应用上述提供的方法之一:

CREATE OR REPLACE TEMP VIEW tmp
AS
select
  code,
  Time,
  Model Type,
  Total Value,
  First Status,
  lead(First Status, 1, null) over(partition by code order by Time asc) as Second Status
from file
where Model Type = 'MXJ'
) t 
where First Status='Turn On' and Second='Turn Off';

SELECT DISTINCT *
FROM tmp
英文:

The data in your questions is not very clear. However, there are two methods that come to mind in de-duplicating data.

The first is to use DISTINCT. So, if you want to remove duplicates based on all of your columns, you can do,

SELECT DISTINCT *
FROM &lt;your_table&gt;

If you want it to be based on a few columns,

SELECT DISTINCT &lt;column_1&gt;, &lt;column_2&gt; ..
FROM &lt;your_table&gt;

The other option is to use GROUP BY with HAVING. You can group by the columns that you want to de-duplicate based on and then filter out rows with a count greater than 1,

SELECT &lt;column_1&gt;, &lt;column_2&gt; ..
FROM &lt;your_table&gt;
GROUP BY &lt;column_1&gt;, &lt;column_2&gt; ..
HAVING COUNT(*) &gt; 1

So, for your situation, I would suggest creating a TEMP VIEW using the query you have already and then applying one of the methods given above,

CREATE OR REPLACE TEMP VIEW tmp
AS
select
  code,
  Time,
  Model Type,
  Total Value,
  First Status,
  lead(First Status, 1, null) over(partition by code order by Time asc) as Second Status
from file
where Model Type = &#39;MXJ&#39;
    ) t 
where First Status=&#39;Turn On&#39; and Second=&#39;Turn Off&#39;

SELECT DISTINCT *
FROM tmp

huangapple
  • 本文由 发表于 2023年2月6日 11:58:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/75357223.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定