英文:
SQL-Could you help me writing a SQL query which remakes new table from existing table without duplicated ID?
问题
Sure, here's the SQL query to create a new table from the existing one without duplicated IDs while preserving the latest data:
CREATE TABLE new_table AS
SELECT DISTINCT ON (Id)
Id,
name,
visited,
column1,
column2
FROM your_existing_table
ORDER BY Id, /* Add any date or timestamp column here to determine the latest */
/* If no timestamp column, use a row number in the ORDER BY clause */
/* For example: ROW_NUMBER() OVER (PARTITION BY Id ORDER BY SomeColumn DESC) */
/* This will rank rows within each Id based on SomeColumn in descending order */
/* You can replace SomeColumn with the appropriate column name */
/* Change DESC to ASC if you want the earliest instead of the latest */
/* You can remove this line if you don't have any timestamp or date-based data */
/* Example without timestamp: ROW_NUMBER() OVER (PARTITION BY Id) */
/* It will rank rows within each Id without considering any specific order */
-- Optional: Drop the old table
-- DROP TABLE your_existing_table;
-- Optional: Rename the new table to match the old one
-- ALTER TABLE new_table RENAME TO your_existing_table;
This query will create a new table called new_table
without duplicated IDs and preserving the latest data based on the specified order criteria. You can optionally drop the old table and rename the new table to match the old one, as indicated in your original query.
英文:
Could you help me write a SQL query which remakes a new table from existing table without duplicated ID?
I want to store most new data among all duplicated Id.
The table looks like this:
Id | name | visited | column1 | column2 |
---|---|---|---|---|
xd01s | sam | 23 | Null | string |
sc01t | susan | 12 | string | string |
t01sc | tom | 22 | Null | Null |
xd01s | san | 12 | string | string |
My table (actually tables) is at Amazon Redshift DB. And while I'm storing my data there.
I found out same Id is getting duplicated regardless of primary key.
So I've decided to recreate the table without duplicated data (erasing seems costly).
From the example table, the new table I want would be like this.
Id | name | visited | column1 | column2 |
---|---|---|---|---|
sc01t | susan | 12 | string | string |
t01sc | tom | 22 | Null | Null |
xd01s | san | 12 | string | string |
Preserving latest data of 'xd01s' and getting rid of other old 'xd01s'.
Any of columns doesn't tell what is most recent one (there isn't time nor date..nor any incremented value).
I think the biggest rownumber from initial order is only way to notice most recent one. But my SQL query I've tried keeps failing (lack of experience..).
So far I've used this query with psycopg2 python package.
alter table my_table rename to my_table_old
create table my_table
as
select distinct *
from my_table_old
drop table if exists my_table_old cascade
but this only gets rid of data which has all column values are same.
Great thanks for reading my question and answering it ^^.
答案1
得分: 0
这个查询会执行以下操作:
SELECT
Id
,name
,visited
,column1
,column2
FROM (
SELECT
Id
,name
,visited
,column1
,column2
,ROW_NUMBER() OVER (PARTITION BY Id) AS rn
FROM
my_table_old
) AS my_table_aux
WHERE
rn = 1
;
在子查询中,我们创建了一个新的列,为字段 Id
中相同值的重复项分配了顺序整数。
Id | name | visited | column1 | column2 | rn |
---|---|---|---|---|---|
xd01s | sam | 23 | Null | string | 1 |
sc01t | susan | 12 | string | string | 1 |
t01sc | tom | 22 | Null | Null | 1 |
xd01s | san | 12 | string | string | 2 |
外部查询过滤内部结果,只保留在相同 Id 分组内的第一行(因为我们确保每个 Id
至少会有一条记录,其 rn
值为 1)。
Id | name | visited | column1 | column2 |
---|---|---|---|---|
xd01s | sam | 23 | Null | string |
sc01t | susan | 12 | string | string |
t01sc | tom | 22 | Null | Null |
请注意,结果并不保证保持原始记录的顺序。如果我们需要为 OVER
子句提供 ORDER BY
子句,我们需要一个字段来提供排序。例如,如果有一个名为 updated_on
的日期时间字段,可以按照从新到旧的顺序排序记录。在这种情况下,查询应该如下所示:
SELECT
Id
,name
,visited
,column1
,column2
,updated_on
FROM (
SELECT
Id
,name
,visited
,column1
,column2
,updated_on
,ROW_NUMBER() OVER (PARTITION BY Id ORDER BY updated_on DESC) AS rn
FROM
my_table_old
) AS my_table_aux
WHERE
rn = 1
;
你可以在ROW_NUMBER窗口函数的文档中详细了解这个功能。
英文:
This query will do:
SELECT
Id
,name
,visited
,column1
,column2
FROM (
SELECT
Id
,name
,visited
,column1
,column2
,ROW_NUMBER() OVER (PARTITION BY Id) AS rn
FROM
my_table_old
) AS my_table_aux
WHERE
rn = 1
;
With the subquery we are creating a new column that assigns a sequential integer to each repetition of the same value within the field Id
.
Id | name | visited | column1 | column2 | rn |
---|---|---|---|---|---|
xd01s | sam | 23 | Null | string | 1 |
sc01t | susan | 12 | string | string | 1 |
t01sc | tom | 22 | Null | Null | 1 |
xd01s | san | 12 | string | string | 2 |
The outer query filters the inner result to keep just the first of the rows within a group rows with the same Id (as we are sure that there will be at least one record for each Id
with a rn
value of 1).
Id | name | visited | column1 | column2 |
---|---|---|---|---|
xd01s | sam | 23 | Null | string |
sc01t | susan | 12 | string | string |
t01sc | tom | 22 | Null | Null |
Note that nothing guaranties that the result will keep the original order of the records. We would need a field to provide an ORDER BY
clause for the OVER
clause.
Let's suppose you have a datetime field named updated_on
that can be used to sort the records from newest to oldest. In that case, the query should be as follows:
SELECT
Id
,name
,visited
,column1
,column2
,updated_on
FROM (
SELECT
Id
,name
,visited
,column1
,column2
,updated_on
,ROW_NUMBER() OVER (PARTITION BY Id ORDER BY updated_on DESC) AS rn
FROM
my_table_old
) AS my_table_aux
WHERE
rn = 1
;
Here you can read the details of ROW_NUMBER window function.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论