英文:
How could I remove duplicate rows in SparkSql?
问题
以下是您提供的数据和代码的翻译:
我的数据如下:
代码 | 时间 | 总值 | 型号类型 | 第一状态 | 第二状态 |
---|---|---|---|---|---|
11111 | 07/06/2022 06:45:42 | 23456 | MXJ | 打开 | 关闭 |
11111 | 07/06/2022 06:45:42 | 23456 | MXJ | 打开 | 关闭 |
11111 | 03/02/2022 08:01:11 | 78231 | MXJ | 打开 | 关闭 |
22222 | 04/03/2022 13:23:54 | 20134 | MXJ | 打开 | 关闭 |
22222 | 04/03/2022 13:23:54 | 20134 | MXJ | 打开 | 关闭 |
我想要的结果:
代码 | 时间 | 总值 | 型号类型 | 第一状态 | 第二状态 |
---|---|---|---|---|---|
11111 | 07/06/2022 06:45:42 | 23456 | MXJ | 打开 | 关闭 |
11111 | 03/02/2022 08:01:11 | 78231 | MXJ | 打开 | 关闭 |
22222 | 04/03/2022 13:23:54 | 20134 | MXJ | 打开 | 关闭 |
我的代码如下:
从
(
选择
代码,
时间,
型号类型,
总值,
第一状态,
lead(第一状态, 1, null) over(partition by 代码 order by 时间 asc) as 第二状态
从文件中选择
其中型号类型 = 'MXJ'
) t
其中第一状态='打开' 和 第二状态='关闭'
限制 5
英文:
My data is like this:
Code | Time | Total Value | Model Type | First Status | Second Status |
---|---|---|---|---|---|
11111 | 07/06/2022 06:45:42 | 23456 | MXJ | Turn On | Turn Off |
11111 | 07/06/2022 06:45:42 | 23456 | MXJ | Turn On | Turn Off |
11111 | 03/02/2022 08:01:11 | 78231 | MXJ | Turn On | Turn Off |
22222 | 04/03/2022 13:23:54 | 20134 | MXJ | Turn On | Turn Off |
22222 | 04/03/2022 13:23:54 | 20134 | MXJ | Turn On | Turn Off |
The result I Want:
Code | Time | Total Value | Model Type | First Status | Second Status |
---|---|---|---|---|---|
11111 | 07/06/2022 06:45:42 | 23456 | MXJ | Turn On | Turn Off |
11111 | 03/02/2022 08:01:11 | 78231 | MXJ | Turn On | Turn Off |
22222 | 04/03/2022 13:23:54 | 20134 | MXJ | Turn On | Turn Off |
My code is like this:
select * from
(
select
code,
Time,
Model Type,
Total Value,
First Status,
lead(First Status, 1, null) over(partition by code order by Time asc) as Second Status
from file
where Model Type = 'MXJ'
) t
where First Status='Turn On' and Second='Turn Off'
limit 5
答案1
得分: 0
以下是翻译好的部分:
"你的问题中的数据不太清晰。然而,有两种方法可以用来去除重复数据。
第一种方法是使用 DISTINCT
。所以,如果你想要基于所有列去除重复项,你可以这样做:
SELECT DISTINCT *
FROM <your_table>
如果你只想基于某些列去重复,可以这样:
SELECT DISTINCT <column_1>, <column_2> ..
FROM <your_table>
另一种选择是使用 GROUP BY
和 HAVING
。你可以根据想要去重的列进行分组,然后过滤掉计数大于1的行:
SELECT <column_1>, <column_2> ..
FROM <your_table>
GROUP BY <column_1>, <column_2> ..
HAVING COUNT(*) > 1
所以,对于你的情况,我建议使用你已经有的查询创建一个临时视图,然后应用上述提供的方法之一:
CREATE OR REPLACE TEMP VIEW tmp
AS
select
code,
Time,
Model Type,
Total Value,
First Status,
lead(First Status, 1, null) over(partition by code order by Time asc) as Second Status
from file
where Model Type = 'MXJ'
) t
where First Status='Turn On' and Second='Turn Off';
SELECT DISTINCT *
FROM tmp
英文:
The data in your questions is not very clear. However, there are two methods that come to mind in de-duplicating data.
The first is to use DISTINCT
. So, if you want to remove duplicates based on all of your columns, you can do,
SELECT DISTINCT *
FROM <your_table>
If you want it to be based on a few columns,
SELECT DISTINCT <column_1>, <column_2> ..
FROM <your_table>
The other option is to use GROUP BY
with HAVING
. You can group by the columns that you want to de-duplicate based on and then filter out rows with a count greater than 1,
SELECT <column_1>, <column_2> ..
FROM <your_table>
GROUP BY <column_1>, <column_2> ..
HAVING COUNT(*) > 1
So, for your situation, I would suggest creating a TEMP VIEW using the query you have already and then applying one of the methods given above,
CREATE OR REPLACE TEMP VIEW tmp
AS
select
code,
Time,
Model Type,
Total Value,
First Status,
lead(First Status, 1, null) over(partition by code order by Time asc) as Second Status
from file
where Model Type = 'MXJ'
) t
where First Status='Turn On' and Second='Turn Off'
SELECT DISTINCT *
FROM tmp
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论