2023年2月6日 11:58:13go评论82阅读模式

英文:

How could I remove duplicate rows in SparkSql？

问题

以下是您提供的数据和代码的翻译：

我的数据如下：

代码	时间	总值	型号类型	第一状态	第二状态
11111	07/06/2022 06:45:42	23456	MXJ	打开	关闭
11111	07/06/2022 06:45:42	23456	MXJ	打开	关闭
11111	03/02/2022 08:01:11	78231	MXJ	打开	关闭
22222	04/03/2022 13:23:54	20134	MXJ	打开	关闭
22222	04/03/2022 13:23:54	20134	MXJ	打开	关闭

我想要的结果：

代码	时间	总值	型号类型	第一状态	第二状态
11111	07/06/2022 06:45:42	23456	MXJ	打开	关闭
11111	03/02/2022 08:01:11	78231	MXJ	打开	关闭
22222	04/03/2022 13:23:54	20134	MXJ	打开	关闭

我的代码如下：

从
(
  选择
     代码,
     时间,
     型号类型,
     总值,
     第一状态,
     lead(第一状态, 1, null) over(partition by 代码 order by 时间 asc) as 第二状态
  从文件中选择
  其中型号类型 = 'MXJ'
) t 
其中第一状态='打开' 和 第二状态='关闭'
限制 5

英文:

My data is like this:

Code	Time	Total Value	Model Type	First Status	Second Status
11111	07/06/2022 06:45:42	23456	MXJ	Turn On	Turn Off
11111	07/06/2022 06:45:42	23456	MXJ	Turn On	Turn Off
11111	03/02/2022 08:01:11	78231	MXJ	Turn On	Turn Off
22222	04/03/2022 13:23:54	20134	MXJ	Turn On	Turn Off
22222	04/03/2022 13:23:54	20134	MXJ	Turn On	Turn Off

The result I Want:

Code	Time	Total Value	Model Type	First Status	Second Status
11111	07/06/2022 06:45:42	23456	MXJ	Turn On	Turn Off
11111	03/02/2022 08:01:11	78231	MXJ	Turn On	Turn Off
22222	04/03/2022 13:23:54	20134	MXJ	Turn On	Turn Off

My code is like this:

select * from 
(
  select
     code,
     Time,
     Model Type,
     Total Value,
     First Status,
     lead(First Status, 1, null) over(partition by code order by Time asc) as Second Status
  from file
  where Model Type = &#39;MXJ&#39;
) t 
where First Status=&#39;Turn On&#39; and Second=&#39;Turn Off&#39;
limit 5

答案1

得分: 0

以下是翻译好的部分：

"你的问题中的数据不太清晰。然而，有两种方法可以用来去除重复数据。

第一种方法是使用 DISTINCT。所以，如果你想要基于所有列去除重复项，你可以这样做：

SELECT DISTINCT *
FROM <your_table>

如果你只想基于某些列去重复，可以这样：

SELECT DISTINCT <column_1>, <column_2> ..
FROM <your_table>

另一种选择是使用 GROUP BY 和 HAVING。你可以根据想要去重的列进行分组，然后过滤掉计数大于1的行：

SELECT <column_1>, <column_2> ..
FROM <your_table>
GROUP BY <column_1>, <column_2> ..
HAVING COUNT(*) > 1

所以，对于你的情况，我建议使用你已经有的查询创建一个临时视图，然后应用上述提供的方法之一：

CREATE OR REPLACE TEMP VIEW tmp
AS
select
  code,
  Time,
  Model Type,
  Total Value,
  First Status,
  lead(First Status, 1, null) over(partition by code order by Time asc) as Second Status
from file
where Model Type = 'MXJ'
) t 
where First Status='Turn On' and Second='Turn Off';
SELECT DISTINCT *
FROM tmp

英文:

The data in your questions is not very clear. However, there are two methods that come to mind in de-duplicating data.

The first is to use DISTINCT. So, if you want to remove duplicates based on all of your columns, you can do,

SELECT DISTINCT *
FROM &lt;your_table&gt;

If you want it to be based on a few columns,

SELECT DISTINCT &lt;column_1&gt;, &lt;column_2&gt; ..
FROM &lt;your_table&gt;

The other option is to use GROUP BY with HAVING. You can group by the columns that you want to de-duplicate based on and then filter out rows with a count greater than 1,

SELECT &lt;column_1&gt;, &lt;column_2&gt; ..
FROM &lt;your_table&gt;
GROUP BY &lt;column_1&gt;, &lt;column_2&gt; ..
HAVING COUNT(*) &gt; 1

So, for your situation, I would suggest creating a TEMP VIEW using the query you have already and then applying one of the methods given above,

CREATE OR REPLACE TEMP VIEW tmp
AS
select
  code,
  Time,
  Model Type,
  Total Value,
  First Status,
  lead(First Status, 1, null) over(partition by code order by Time asc) as Second Status
from file
where Model Type = &#39;MXJ&#39;
    ) t 
where First Status=&#39;Turn On&#39; and Second=&#39;Turn Off&#39;
SELECT DISTINCT *
FROM tmp

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

你如何在SparkSql中删除重复行？

问题

答案1

这个查询的 from 部分是什么意思？

将顾客和购买日期表格合并以获取最新的购买记录，但要包括空值。

如何在MySQL中转换月份

合并时，不匹配源数据而不删除其他行。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。