SQL的合并(Merge)与Java中的检查并插入/更新(Check and Insert/Update)相比较:

huangapple go评论96阅读模式
英文:

SQL Merge vs Check and Insert/Update in Java

问题

我有一个Java(Spring)REST API端点,在这里我获得3个数据输入,我需要根据一些唯一的“ID”使用“JDBCTemplate”插入Oracle数据库。但只是为了确保不出问题,我想首先进行检查,看看是否需要插入还是只需更新。

第一种方法

使用一个简单的查询进行数据库调用,类似于:

SELECT COUNT(*) FROM TABLENAME WHERE ID='ABC' AND ROWNUM=1

根据计数的值,根据需要进行单独的插入或更新数据库调用。(计数永远不会超过1)

第二种方法

使用jdbctemplate.update()进行一次单一的MERGE查询操作,类似于:

MERGE INTO TABLENAME 
USING DUAL ON ID='ABC'
WHEN MATCHED THEN UPDATE 
    SET COL1='A', COL2='B'
    WHERE ID='ABC'
WHEN NOT MATCHED THEN
    INSERT (ID, COL1, COL2) VALUES ('ABC','A','B')

根据我在不同网站上阅读的内容,使用MERGE在CPU读取方面稍微昂贵,根据此网站上的实验进行了验证。但他们纯粹是为了数据库脚本使用而这么做的,在他们的上下文中,他们使用了2个表,而我的使用背景是通过API调用和使用DUAL

我还在这个问题上阅读过MERGE可能会导致ORA-0001: unique constraint和一些并发问题。

我想在可以同时进行不同行的其他操作的表上执行此操作,对于相同行值,可能性非常小。因此,我想知道应该为这种用例选择哪种方法,我知道这可能是一个常见的问题,但我找不到我想要的答案。我想了解这两种方法的性能/可靠性。

英文:

I have an Java(Spring) REST API endpoint where I get 3 data inputs and I need to Insert in the oracle database based on some unique ID using JDBCTemplate. But just to be sure something doesn't break, I want have a check first if I need to insert or just update.

1st Approach

Make a database call with a simple query like

 SELECT COUNT(*) FROM TABLENAME WHERE ID='ABC' AND ROWNUM=1

And based on the value of count, make a separate Database call for Insert or Update. (count would never exceed 1)

2nd Approach

Make one single MERGE query hit using jdbctemplate.update() that would look like

MERGE INTO TABLENAME 
USING DUAL ON ID='ABC'
WHEN MATCHED THEN UPDATE 
    SET COL1='A', COL2='B'
    WHERE ID='ABC'
WHEN NOT MATCHED THEN
    INSERT (ID, COL1, COL2) VALUES ('ABC','A','B')

Based on what I read on different sites, using MERGE is a bit more costly in terms of CPU reads based on an experiment on this site. But they have done it for purely for DB script use where they do it with 2 tables and my context of use is via API call and using DUAL.

I also read on this question that MERGE could result in ORA-0001: unique constraint and some concurrency issue.

I want to do this on a table on which some other operation is possible at the same time for a different row and a very very small chance for the same row value. So I want to know which approach to follow for such use case and I know this might be a common one but I could not find answer to what I'm looking for anywhere. I want to know the performance/reliability of both approach.

答案1

得分: 2

Looking at the code running in concurrent sessions environment, after each atomic statement we need to ask "what if another session have just broken our assumption?" and make adjustments according to that.

Option 1. Count and decide INSERT or UPDATE

declare
  v_count int;
begin
  SELECT count(1) INTO v_count FROM my_table WHERE ...;

  IF v_count = 0 THEN
    -- what if another session inserted the same row just before this point?
    -- this statement will fail
    INSERT INTO my_table ...;
  ELSE
    UPDATE my_table ...;
  END IF;
end;

Option 2. UPDATE, if nothing is updated - INSERT

begin
  UPDATE my_table WHERE ...;

  IF SQL%COUNT = 0 THEN
    -- what if another session inserted the same row just before this point?
    -- this statement will fail
    INSERT INTO my_table ...;
  END IF;
end;

Option 3. INSERT, if failed - UPDATE

begin
  INSERT INTO my_table ...;
exception when DUP_VAL_ON_INDEX then
  -- what if another session updated the same row just before this point?
  -- this statement will override previous changes

  -- what if another session deleted this row?
  -- this statement will do nothing silently - is it satisfactory?

  -- what if another session locked this row for update?
  -- this statement will fail

  UPDATE my_table WHERE ...;
end;

Option 4. use MERGE

MERGE INTO my_table
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...

-- We have no place to put our "what if" question,
-- but unfortunately MERGE is not atomic,
-- it is just a syntactic sugar for the option #1

Option 5. use interface for DML on my_table

-- Create single point of modifications for my_table and prevent direct DML.
-- For instance, if client has no direct access to my_table,
-- use locks to guarantee that only one session at a time
-- can INSERT/UPDATE/DELETE a particular table row.

-- This could be achieved with a stored procedure or a view "INSTEAD OF" trigger.

-- Client has access to the interface only (view and procedures),
-- but the table is hidden.
my_table_v -- VIEW AS SELECT * FROM my_table
my_table_ins_or_upd_proc -- PROCEDURE (...) BEGIN ...DML on my_table ... END;

PROCEDURE my_table_ins_or_upd_proc(pi_row my_table%ROWTYPE) is
l_lock_handle CONSTANT VARCHAR2(100) := 'my_table_' || pi_row.id;
-- independent lock handle for each id allows
-- operating on different ids in parallel
begin
begin
request_lock(l_lock_handle);

-->> this code is exactly as in option #2

UPDATE my_table WHERE ...;

IF SQL%COUNT = 0 THEN
  -- what if another session inserted the same row just before this point?
  -- NOPE it cannot happen: another session is waiting for a lock on the line # request_lock(...)
  INSERT INTO my_table ...;
END IF;

--<<

exception when others then
release_lock(l_lock_handle);
raise;
end;

release_lock(l_lock_handle);
end;

Thus, we see that options 1,2,3,4 have potential problems that cannot be avoided in a general case. But they could be applied if the safety is guaranteed by domain rules or a particular design conventions.

Option 5 is bulletproof and fast as it relies on the DBMS contracts. Nevertheless, this will be a prize for clean design, and it cannot be implemented if my_table is bare-naked and clients rely on straightforward DML on this table.

I believe that performance is less important than data integrity, but let's mention that for completeness. After proper consideration, it is easy to see that the options order according to the "theoretical" average performance is:

2 -> 5 -> (1,4) -> 3

Of course, the step of performance measuring goes after obtaining at least two properly working solutions, and should be done exclusively for a particular application under a given workload profile. And that is another story. At this moment no need to bother about theoretical nanoseconds in some synthetic benchmarks.

I guess currently we see that there will be no magic. Somewhere in the application, it is required to ensure that every id inserted into my_table is unique.

If id values do not matter (95% of cases) - just go for using a SEQUENCE.

Otherwise, create a single point of manipulation on my_table (either in Java or in the DBMS schema PL/SQL) and control the uniqueness there. If the application can guarantee that at most a single session at a time manipulates data in my_table, then it is possible to just apply the option #2.

英文:

Looking at the code running in concurrent sessions environment, after each atomic statement we need to ask "what if another session have just broken our assumption?" and make adjustments according to that.

Option 1. Count and decide INSERT or UPDATE

declare
  v_count int;
begin
  SELECT count(1) INTO v_count FROM my_table WHERE ...;

  IF v_count = 0 THEN
    -- what if another session inserted the same row just before this point?
    -- this statement will fail
    INSERT INTO my_table ...;
  ELSE
    UPDATE my_table ...;
  END IF;
end;  

Option 2. UPDATE, if nothing is updated - INSERT

begin
  UPDATE my_table WHERE ...;

  IF SQL%COUNT = 0 THEN
    -- what if another session inserted the same row just before this point?
    -- this statement will fail
	INSERT INTO my_table ...;
  END IF;
end;  

Option 3. INSERT, if failed - UPDATE

begin
  INSERT INTO my_table ...;
exception when DUP_VAL_ON_INDEX then
  -- what if another session updated the same row just before this point?
  -- this statement will override previous changes
   
  -- what if another session deleted this row?
  -- this statement will do nothing silently - is it satisfactory?
   
  -- what if another session locked this row for update?
  -- this statement will fail  

  UPDATE my_table WHERE ...;  
end;

Option 4. use MERGE

 MERGE INTO my_table
	WHEN MATCHED THEN UPDATE ...
	WHEN NOT MATCHED THEN INSERT ...
	
 -- We have no place to put our &quot;what if&quot; question,
 -- but unfortunately MERGE is not atomic,
 -- it is just a syntactic sugar for the option #1

Option 5. use interface for DML on my_table

 -- Create single point of modifications for my_table and prevent direct DML.
 -- For instance, if client has no direct access to my_table,
 -- use locks to guarantee that only one session at a time
 -- can INSERT/UPDATE/DELETE a particular table row.

 -- This could be achieved with a stored procedure or a view &quot;INSTEAD OF&quot; trigger.

 -- Client has access to the interface only (view and procedures),
 -- but the table is hidden.
     my_table_v                -- VIEW AS SELECT * FROM my_table
     my_table_ins_or_upd_proc  -- PROCEDURE (...) BEGIN ...DML on my_table ... END; 

  

 PROCEDURE my_table_ins_or_upd_proc(pi_row  my_table%ROWTYPE) is  
   l_lock_handle CONSTANT VARCHAR2(100) := &#39;my_table_&#39; || pi_row.id;
   -- independent lock handle for each id allows
   -- operating on different ids in parallel
 begin
   begin
     request_lock(l_lock_handle);
     
     --&gt;&gt; this code is exactly as in option #2
	 
 	 UPDATE my_table WHERE ...;
   
     IF SQL%COUNT = 0 THEN
       -- what if another session inserted the same row just before this point?
       -- NOPE it cannot happen: another session is waiting for a lock on the line # request_lock(...)
	   INSERT INTO my_table ...;
     END IF;
	 
	 --&lt;&lt;
	 
   exception when others then
     release_lock(l_lock_handle);
	 raise;
   end;
   
   release_lock(l_lock_handle); 
 end;

Not going too deep into low level details here, see this article to find out how to use locks in Oracle DBMS.


Thus, we see that options 1,2,3,4 have potential problems that cannot be avoided in a general case. But they could be applied if the safety is guaranteed by domain rules or a particular design conventions.

Option 5 is bulletproof and fast as it is relies on the DBMS contracts.
Nevertheless, this will be a prize for clean design, and it cannot be implemented if my_table is barenaked and clients rely on straightforward DML on this table.

I believe that performance is less important than data integrity, but let's mention that for completeness.
After proper consideration it is easy to see that the options order according to the "theoretical" average performance is:

2 -&gt; 5 -&gt; (1,4) -&gt; 3

Of course, the step of performance measuring goes after obtaining at least two properly working solutions, and should be done exclusively for a particular application under a given workload profile. And that is another story. At this moment no need to bother about theoretical nanoseconds in some synthetic benchmarks.


I guess currently we see that there will be no magic. Somewhere in the application it is required to ensure that every id inserted into my_table is unique.

If id values do not matter (95% of cases) - just go for using a SEQUENCE.

Otherwise, create a single point of manipulation on my_table (either in Java or in DBMS schema PL/SQL) and control the uniqueness there. If the application can guarantee that at most a single session at a time manipulates data in my_table, then it is possible to just apply the option #2.

huangapple
  • 本文由 发表于 2020年8月14日 19:08:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/63411611.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定