英文:
SQL Merge vs Check and Insert/Update in Java
问题
我有一个Java(Spring)REST API端点,在这里我获得3个数据输入,我需要根据一些唯一的“ID”使用“JDBCTemplate”插入Oracle数据库。但只是为了确保不出问题,我想首先进行检查,看看是否需要插入还是只需更新。
第一种方法
使用一个简单的查询进行数据库调用,类似于:
SELECT COUNT(*) FROM TABLENAME WHERE ID='ABC' AND ROWNUM=1
根据计数的值,根据需要进行单独的插入或更新数据库调用。(计数永远不会超过1)
第二种方法
使用jdbctemplate.update()
进行一次单一的MERGE
查询操作,类似于:
MERGE INTO TABLENAME
USING DUAL ON ID='ABC'
WHEN MATCHED THEN UPDATE
SET COL1='A', COL2='B'
WHERE ID='ABC'
WHEN NOT MATCHED THEN
INSERT (ID, COL1, COL2) VALUES ('ABC','A','B')
根据我在不同网站上阅读的内容,使用MERGE
在CPU读取方面稍微昂贵,根据此网站上的实验进行了验证。但他们纯粹是为了数据库脚本使用而这么做的,在他们的上下文中,他们使用了2个表,而我的使用背景是通过API调用和使用DUAL
。
我还在这个问题上阅读过,MERGE
可能会导致ORA-0001: unique constraint
和一些并发问题。
我想在可以同时进行不同行的其他操作的表上执行此操作,对于相同行值,可能性非常小。因此,我想知道应该为这种用例选择哪种方法,我知道这可能是一个常见的问题,但我找不到我想要的答案。我想了解这两种方法的性能/可靠性。
英文:
I have an Java(Spring) REST API endpoint where I get 3 data inputs and I need to Insert in the oracle database based on some unique ID
using JDBCTemplate
. But just to be sure something doesn't break, I want have a check first if I need to insert or just update.
1st Approach
Make a database call with a simple query like
SELECT COUNT(*) FROM TABLENAME WHERE ID='ABC' AND ROWNUM=1
And based on the value of count, make a separate Database call for Insert or Update. (count would never exceed 1)
2nd Approach
Make one single MERGE
query hit using jdbctemplate.update()
that would look like
MERGE INTO TABLENAME
USING DUAL ON ID='ABC'
WHEN MATCHED THEN UPDATE
SET COL1='A', COL2='B'
WHERE ID='ABC'
WHEN NOT MATCHED THEN
INSERT (ID, COL1, COL2) VALUES ('ABC','A','B')
Based on what I read on different sites, using MERGE
is a bit more costly in terms of CPU reads based on an experiment on this site. But they have done it for purely for DB script use where they do it with 2 tables and my context of use is via API call and using DUAL
.
I also read on this question that MERGE
could result in ORA-0001: unique constraint
and some concurrency issue.
I want to do this on a table on which some other operation is possible at the same time for a different row and a very very small chance for the same row value. So I want to know which approach to follow for such use case and I know this might be a common one but I could not find answer to what I'm looking for anywhere. I want to know the performance/reliability of both approach.
答案1
得分: 2
Looking at the code running in concurrent sessions environment, after each atomic statement we need to ask "what if another session have just broken our assumption?" and make adjustments according to that.
Option 1. Count and decide INSERT or UPDATE
declare
v_count int;
begin
SELECT count(1) INTO v_count FROM my_table WHERE ...;
IF v_count = 0 THEN
-- what if another session inserted the same row just before this point?
-- this statement will fail
INSERT INTO my_table ...;
ELSE
UPDATE my_table ...;
END IF;
end;
Option 2. UPDATE, if nothing is updated - INSERT
begin
UPDATE my_table WHERE ...;
IF SQL%COUNT = 0 THEN
-- what if another session inserted the same row just before this point?
-- this statement will fail
INSERT INTO my_table ...;
END IF;
end;
Option 3. INSERT, if failed - UPDATE
begin
INSERT INTO my_table ...;
exception when DUP_VAL_ON_INDEX then
-- what if another session updated the same row just before this point?
-- this statement will override previous changes
-- what if another session deleted this row?
-- this statement will do nothing silently - is it satisfactory?
-- what if another session locked this row for update?
-- this statement will fail
UPDATE my_table WHERE ...;
end;
Option 4. use MERGE
MERGE INTO my_table
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...
-- We have no place to put our "what if" question,
-- but unfortunately MERGE is not atomic,
-- it is just a syntactic sugar for the option #1
Option 5. use interface for DML on my_table
-- Create single point of modifications for my_table and prevent direct DML.
-- For instance, if client has no direct access to my_table,
-- use locks to guarantee that only one session at a time
-- can INSERT/UPDATE/DELETE a particular table row.
-- This could be achieved with a stored procedure or a view "INSTEAD OF" trigger.
-- Client has access to the interface only (view and procedures),
-- but the table is hidden.
my_table_v -- VIEW AS SELECT * FROM my_table
my_table_ins_or_upd_proc -- PROCEDURE (...) BEGIN ...DML on my_table ... END;
PROCEDURE my_table_ins_or_upd_proc(pi_row my_table%ROWTYPE) is
l_lock_handle CONSTANT VARCHAR2(100) := 'my_table_' || pi_row.id;
-- independent lock handle for each id allows
-- operating on different ids in parallel
begin
begin
request_lock(l_lock_handle);
-->> this code is exactly as in option #2
UPDATE my_table WHERE ...;
IF SQL%COUNT = 0 THEN
-- what if another session inserted the same row just before this point?
-- NOPE it cannot happen: another session is waiting for a lock on the line # request_lock(...)
INSERT INTO my_table ...;
END IF;
--<<
exception when others then
release_lock(l_lock_handle);
raise;
end;
release_lock(l_lock_handle);
end;
Thus, we see that options 1,2,3,4 have potential problems that cannot be avoided in a general case. But they could be applied if the safety is guaranteed by domain rules or a particular design conventions.
Option 5 is bulletproof and fast as it relies on the DBMS contracts. Nevertheless, this will be a prize for clean design, and it cannot be implemented if my_table
is bare-naked and clients rely on straightforward DML on this table.
I believe that performance is less important than data integrity, but let's mention that for completeness. After proper consideration, it is easy to see that the options order according to the "theoretical" average performance is:
2 -> 5 -> (1,4) -> 3
Of course, the step of performance measuring goes after obtaining at least two properly working solutions, and should be done exclusively for a particular application under a given workload profile. And that is another story. At this moment no need to bother about theoretical nanoseconds in some synthetic benchmarks.
I guess currently we see that there will be no magic. Somewhere in the application, it is required to ensure that every id inserted into my_table is unique.
If id values do not matter (95% of cases) - just go for using a SEQUENCE.
Otherwise, create a single point of manipulation on my_table (either in Java or in the DBMS schema PL/SQL) and control the uniqueness there. If the application can guarantee that at most a single session at a time manipulates data in my_table, then it is possible to just apply the option #2.
英文:
Looking at the code running in concurrent sessions environment, after each atomic statement we need to ask "what if another session have just broken our assumption?" and make adjustments according to that.
Option 1. Count and decide INSERT or UPDATE
declare
v_count int;
begin
SELECT count(1) INTO v_count FROM my_table WHERE ...;
IF v_count = 0 THEN
-- what if another session inserted the same row just before this point?
-- this statement will fail
INSERT INTO my_table ...;
ELSE
UPDATE my_table ...;
END IF;
end;
Option 2. UPDATE, if nothing is updated - INSERT
begin
UPDATE my_table WHERE ...;
IF SQL%COUNT = 0 THEN
-- what if another session inserted the same row just before this point?
-- this statement will fail
INSERT INTO my_table ...;
END IF;
end;
Option 3. INSERT, if failed - UPDATE
begin
INSERT INTO my_table ...;
exception when DUP_VAL_ON_INDEX then
-- what if another session updated the same row just before this point?
-- this statement will override previous changes
-- what if another session deleted this row?
-- this statement will do nothing silently - is it satisfactory?
-- what if another session locked this row for update?
-- this statement will fail
UPDATE my_table WHERE ...;
end;
Option 4. use MERGE
MERGE INTO my_table
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...
-- We have no place to put our "what if" question,
-- but unfortunately MERGE is not atomic,
-- it is just a syntactic sugar for the option #1
Option 5. use interface for DML on my_table
-- Create single point of modifications for my_table and prevent direct DML.
-- For instance, if client has no direct access to my_table,
-- use locks to guarantee that only one session at a time
-- can INSERT/UPDATE/DELETE a particular table row.
-- This could be achieved with a stored procedure or a view "INSTEAD OF" trigger.
-- Client has access to the interface only (view and procedures),
-- but the table is hidden.
my_table_v -- VIEW AS SELECT * FROM my_table
my_table_ins_or_upd_proc -- PROCEDURE (...) BEGIN ...DML on my_table ... END;
PROCEDURE my_table_ins_or_upd_proc(pi_row my_table%ROWTYPE) is
l_lock_handle CONSTANT VARCHAR2(100) := 'my_table_' || pi_row.id;
-- independent lock handle for each id allows
-- operating on different ids in parallel
begin
begin
request_lock(l_lock_handle);
-->> this code is exactly as in option #2
UPDATE my_table WHERE ...;
IF SQL%COUNT = 0 THEN
-- what if another session inserted the same row just before this point?
-- NOPE it cannot happen: another session is waiting for a lock on the line # request_lock(...)
INSERT INTO my_table ...;
END IF;
--<<
exception when others then
release_lock(l_lock_handle);
raise;
end;
release_lock(l_lock_handle);
end;
Not going too deep into low level details here, see this article to find out how to use locks in Oracle DBMS.
Thus, we see that options 1,2,3,4 have potential problems that cannot be avoided in a general case. But they could be applied if the safety is guaranteed by domain rules or a particular design conventions.
Option 5 is bulletproof and fast as it is relies on the DBMS contracts.
Nevertheless, this will be a prize for clean design, and it cannot be implemented if my_table
is barenaked and clients rely on straightforward DML on this table.
I believe that performance is less important than data integrity, but let's mention that for completeness.
After proper consideration it is easy to see that the options order according to the "theoretical" average performance is:
2 -> 5 -> (1,4) -> 3
Of course, the step of performance measuring goes after obtaining at least two properly working solutions, and should be done exclusively for a particular application under a given workload profile. And that is another story. At this moment no need to bother about theoretical nanoseconds in some synthetic benchmarks.
I guess currently we see that there will be no magic. Somewhere in the application it is required to ensure that every id
inserted into my_table
is unique.
If id values do not matter (95% of cases) - just go for using a SEQUENCE.
Otherwise, create a single point of manipulation on my_table
(either in Java or in DBMS schema PL/SQL) and control the uniqueness there. If the application can guarantee that at most a single session at a time manipulates data in my_table
, then it is possible to just apply the option #2.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论