2020年8月14日 19:08:50go评论96阅读模式

英文:

SQL Merge vs Check and Insert/Update in Java

问题

我有一个Java（Spring）REST API端点，在这里我获得3个数据输入，我需要根据一些唯一的“ID”使用“JDBCTemplate”插入Oracle数据库。但只是为了确保不出问题，我想首先进行检查，看看是否需要插入还是只需更新。

第一种方法

使用一个简单的查询进行数据库调用，类似于：

SELECT COUNT(*) FROM TABLENAME WHERE ID='ABC' AND ROWNUM=1

根据计数的值，根据需要进行单独的插入或更新数据库调用。（计数永远不会超过1）

第二种方法

使用jdbctemplate.update()进行一次单一的MERGE查询操作，类似于：

MERGE INTO TABLENAME 
USING DUAL ON ID='ABC'
WHEN MATCHED THEN UPDATE 
    SET COL1='A', COL2='B'
    WHERE ID='ABC'
WHEN NOT MATCHED THEN
    INSERT (ID, COL1, COL2) VALUES ('ABC','A','B')

根据我在不同网站上阅读的内容，使用MERGE在CPU读取方面稍微昂贵，根据此网站上的实验进行了验证。但他们纯粹是为了数据库脚本使用而这么做的，在他们的上下文中，他们使用了2个表，而我的使用背景是通过API调用和使用DUAL。

我还在这个问题上阅读过，MERGE可能会导致ORA-0001: unique constraint和一些并发问题。

我想在可以同时进行不同行的其他操作的表上执行此操作，对于相同行值，可能性非常小。因此，我想知道应该为这种用例选择哪种方法，我知道这可能是一个常见的问题，但我找不到我想要的答案。我想了解这两种方法的性能/可靠性。

英文:

I have an Java(Spring) REST API endpoint where I get 3 data inputs and I need to Insert in the oracle database based on some unique ID using JDBCTemplate. But just to be sure something doesn't break, I want have a check first if I need to insert or just update.

1st Approach

Make a database call with a simple query like

 SELECT COUNT(*) FROM TABLENAME WHERE ID=&#39;ABC&#39; AND ROWNUM=1

And based on the value of count, make a separate Database call for Insert or Update. (count would never exceed 1)

2nd Approach

Make one single MERGE query hit using jdbctemplate.update() that would look like

MERGE INTO TABLENAME 
USING DUAL ON ID=&#39;ABC&#39;
WHEN MATCHED THEN UPDATE 
    SET COL1=&#39;A&#39;, COL2=&#39;B&#39;
    WHERE ID=&#39;ABC&#39;
WHEN NOT MATCHED THEN
    INSERT (ID, COL1, COL2) VALUES (&#39;ABC&#39;,&#39;A&#39;,&#39;B&#39;)

Based on what I read on different sites, using MERGE is a bit more costly in terms of CPU reads based on an experiment on this site. But they have done it for purely for DB script use where they do it with 2 tables and my context of use is via API call and using DUAL.

I also read on this question that MERGE could result in ORA-0001: unique constraint and some concurrency issue.

I want to do this on a table on which some other operation is possible at the same time for a different row and a very very small chance for the same row value. So I want to know which approach to follow for such use case and I know this might be a common one but I could not find answer to what I'm looking for anywhere. I want to know the performance/reliability of both approach.

答案1

得分: 2

Looking at the code running in concurrent sessions environment, after each atomic statement we need to ask "what if another session have just broken our assumption?" and make adjustments according to that.

Option 1. Count and decide INSERT or UPDATE

declare
  v_count int;
begin
  SELECT count(1) INTO v_count FROM my_table WHERE ...;

  IF v_count = 0 THEN
    -- what if another session inserted the same row just before this point?
    -- this statement will fail
    INSERT INTO my_table ...;
  ELSE
    UPDATE my_table ...;
  END IF;
end;

Option 2. UPDATE, if nothing is updated - INSERT

begin
  UPDATE my_table WHERE ...;

  IF SQL%COUNT = 0 THEN
    -- what if another session inserted the same row just before this point?
    -- this statement will fail
    INSERT INTO my_table ...;
  END IF;
end;

Option 3. INSERT, if failed - UPDATE

begin
  INSERT INTO my_table ...;
exception when DUP_VAL_ON_INDEX then
  -- what if another session updated the same row just before this point?
  -- this statement will override previous changes

  -- what if another session deleted this row?
  -- this statement will do nothing silently - is it satisfactory?

  -- what if another session locked this row for update?
  -- this statement will fail

  UPDATE my_table WHERE ...;
end;

Option 4. use MERGE

MERGE INTO my_table
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...

-- We have no place to put our "what if" question,
-- but unfortunately MERGE is not atomic,
-- it is just a syntactic sugar for the option #1

Option 5. use interface for DML on my_table

-- Create single point of modifications for my_table and prevent direct DML.
-- For instance, if client has no direct access to my_table,
-- use locks to guarantee that only one session at a time
-- can INSERT/UPDATE/DELETE a particular table row.

-- This could be achieved with a stored procedure or a view "INSTEAD OF" trigger.

-- Client has access to the interface only (view and procedures),
-- but the table is hidden.
my_table_v -- VIEW AS SELECT * FROM my_table
my_table_ins_or_upd_proc -- PROCEDURE (...) BEGIN ...DML on my_table ... END;

PROCEDURE my_table_ins_or_upd_proc(pi_row my_table%ROWTYPE) is
l_lock_handle CONSTANT VARCHAR2(100) := 'my_table_' || pi_row.id;
-- independent lock handle for each id allows
-- operating on different ids in parallel
begin
begin
request_lock(l_lock_handle);

-->> this code is exactly as in option #2

UPDATE my_table WHERE ...;

IF SQL%COUNT = 0 THEN
  -- what if another session inserted the same row just before this point?
  -- NOPE it cannot happen: another session is waiting for a lock on the line # request_lock(...)
  INSERT INTO my_table ...;
END IF;

--<<

exception when others then
release_lock(l_lock_handle);
raise;
end;

release_lock(l_lock_handle);
end;

Thus, we see that options 1,2,3,4 have potential problems that cannot be avoided in a general case. But they could be applied if the safety is guaranteed by domain rules or a particular design conventions.

Option 5 is bulletproof and fast as it relies on the DBMS contracts. Nevertheless, this will be a prize for clean design, and it cannot be implemented if my_table is bare-naked and clients rely on straightforward DML on this table.

I believe that performance is less important than data integrity, but let's mention that for completeness. After proper consideration, it is easy to see that the options order according to the "theoretical" average performance is:

2 -> 5 -> (1,4) -> 3

Of course, the step of performance measuring goes after obtaining at least two properly working solutions, and should be done exclusively for a particular application under a given workload profile. And that is another story. At this moment no need to bother about theoretical nanoseconds in some synthetic benchmarks.

I guess currently we see that there will be no magic. Somewhere in the application, it is required to ensure that every id inserted into my_table is unique.

If id values do not matter (95% of cases) - just go for using a SEQUENCE.

Otherwise, create a single point of manipulation on my_table (either in Java or in the DBMS schema PL/SQL) and control the uniqueness there. If the application can guarantee that at most a single session at a time manipulates data in my_table, then it is possible to just apply the option #2.

英文:

Option 1. Count and decide INSERT or UPDATE

declare
  v_count int;
begin
  SELECT count(1) INTO v_count FROM my_table WHERE ...;

  IF v_count = 0 THEN
    -- what if another session inserted the same row just before this point?
    -- this statement will fail
    INSERT INTO my_table ...;
  ELSE
    UPDATE my_table ...;
  END IF;
end;

Option 2. UPDATE, if nothing is updated - INSERT

begin
  UPDATE my_table WHERE ...;

  IF SQL%COUNT = 0 THEN
    -- what if another session inserted the same row just before this point?
    -- this statement will fail
	INSERT INTO my_table ...;
  END IF;
end;

Option 3. INSERT, if failed - UPDATE

begin
  INSERT INTO my_table ...;
exception when DUP_VAL_ON_INDEX then
  -- what if another session updated the same row just before this point?
  -- this statement will override previous changes
   
  -- what if another session deleted this row?
  -- this statement will do nothing silently - is it satisfactory?
   
  -- what if another session locked this row for update?
  -- this statement will fail  

  UPDATE my_table WHERE ...;  
end;

Option 4. use MERGE

 MERGE INTO my_table
	WHEN MATCHED THEN UPDATE ...
	WHEN NOT MATCHED THEN INSERT ...
	
 -- We have no place to put our &quot;what if&quot; question,
 -- but unfortunately MERGE is not atomic,
 -- it is just a syntactic sugar for the option #1

Option 5. use interface for DML on my_table

 -- Create single point of modifications for my_table and prevent direct DML.
 -- For instance, if client has no direct access to my_table,
 -- use locks to guarantee that only one session at a time
 -- can INSERT/UPDATE/DELETE a particular table row.

 -- This could be achieved with a stored procedure or a view &quot;INSTEAD OF&quot; trigger.

 -- Client has access to the interface only (view and procedures),
 -- but the table is hidden.
     my_table_v                -- VIEW AS SELECT * FROM my_table
     my_table_ins_or_upd_proc  -- PROCEDURE (...) BEGIN ...DML on my_table ... END; 

  

 PROCEDURE my_table_ins_or_upd_proc(pi_row  my_table%ROWTYPE) is  
   l_lock_handle CONSTANT VARCHAR2(100) := &#39;my_table_&#39; || pi_row.id;
   -- independent lock handle for each id allows
   -- operating on different ids in parallel
 begin
   begin
     request_lock(l_lock_handle);
     
     --&gt;&gt; this code is exactly as in option #2
	 
 	 UPDATE my_table WHERE ...;
   
     IF SQL%COUNT = 0 THEN
       -- what if another session inserted the same row just before this point?
       -- NOPE it cannot happen: another session is waiting for a lock on the line # request_lock(...)
	   INSERT INTO my_table ...;
     END IF;
	 
	 --&lt;&lt;
	 
   exception when others then
     release_lock(l_lock_handle);
	 raise;
   end;
   
   release_lock(l_lock_handle); 
 end;

Not going too deep into low level details here, see this article to find out how to use locks in Oracle DBMS.

Option 5 is bulletproof and fast as it is relies on the DBMS contracts.
Nevertheless, this will be a prize for clean design, and it cannot be implemented if my_table is barenaked and clients rely on straightforward DML on this table.

I believe that performance is less important than data integrity, but let's mention that for completeness.
After proper consideration it is easy to see that the options order according to the "theoretical" average performance is:

2 -&gt; 5 -&gt; (1,4) -&gt; 3

Of course, the step of performance measuring goes after obtaining at least two properly working solutions, and should be done exclusively for a particular application under a given workload profile. And that is another story. At this moment no need to bother about theoretical nanoseconds in some synthetic benchmarks.

I guess currently we see that there will be no magic. Somewhere in the application it is required to ensure that every id inserted into my_table is unique.

If id values do not matter (95% of cases) - just go for using a SEQUENCE.

Otherwise, create a single point of manipulation on my_table (either in Java or in DBMS schema PL/SQL) and control the uniqueness there. If the application can guarantee that at most a single session at a time manipulates data in my_table, then it is possible to just apply the option #2.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

SQL的合并（Merge）与Java中的检查并插入/更新（Check and Insert/Update）相比较：

问题

答案1

将一个方法转换为泛型方法

使用对象一次只能由一个线程使用

Spring Webflux + Spring Security for X509 certificate validation issue: x509.X509CertImpl cannot be cast to class java.lang.String

如何在JVM启动时预加载已使用的类？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论