英文:
Doubts about Mysql query statements: EXISTS and IN
问题
我有三个表:jcxx_device_check_log,sys_depart 和 jcxx_device_info。
jcxx_device_check_log 表包含 640,000 条数据。
sys_depart 表包含 260 条数据。
jcxx_device_info 表包含 840 条数据。
exists SQL 查询的 EXPLAIN 格式为 JSON:
"cost_info": { "query_cost": "453972.80" },
in SQL 查询的 EXPLAIN 格式为 JSON:
"cost_info": { "query_cost": "2025375.58" },
这是我的猜测:
exists SQL 查询的结果集不保存在内存中,只返回 true 和 false。
这是我的问题:
我想知道我的猜测是否正确,以及在遇到大量数据时如何选择语句。
英文:
I have three tables jcxx_device_check_log,sys_depart and jcxx_device_info
the jcxx_device_check_log contains 640000 data volume,
the sys_depart contains 260 data volume
the jcxx_device_info contains 840 data volume
the exists sql
EXPLAIN
format = Json
SELECT state,check_time,note,device_id,COUNT(1) FROM jcxx_device_check_log jc
WHERE exists (
SELECT jdi.id FROM sys_depart sd
INNER JOIN jcxx_device_info jdi on sd.id=jdi.sys_depart_id
where jc.device_id = jdi.id
) and check_time BETWEEN '2023-4-18 10:00:00' AND ' 2023-4-18 18:00:00'
GROUP BY device_id,state
ORDER BY device_id,state
the EXPLAIN format = Json results
"cost_info": {
"query_cost": "453972.80"
},
the in sql
EXPLAIN
format = Json
SELECT state,check_time,device_id,COUNT(1) FROM jcxx_device_check_log jc
WHERE device_id in (
SELECT jdi.id FROM sys_depart sd
INNER JOIN jcxx_device_info jdi on jdi.sys_depart_id=sd.id
) and check_time BETWEEN '2023-4-18 10:00:00' AND ' 2023-4-18 18:00:00'
GROUP BY device_id,state
ORDER BY device_id,state;
the EXPLAIN format = Json results
"cost_info": {
"query_cost": "2025375.58"
},
this is my guess
the result set of exists sql does not exist in the memory and only returns true and false.
this is my question
I want to know whether my guess is correct and how to choose a statement next time I encounter a large amount of data
答案1
得分: 2
以下是您要翻译的内容:
There is no single simple answer to "how to choose a statement next time" as the conditions of different tables/indices/statistics etc. will all influence the outcome. Furthermore, there is a lot of myth and legend around the topics of IN
and EXISTS
which generally result in a rule of thumb like this: "[The general rule of thumb is that if the subquery contains a large volume of data, the EXISTS operator provides better performance.]"
However, such rules of thumb are difficult (if not impossible) to prove and yet again may not be true if conditions are suited to use of IN() (e.g. if the in lists an indexed column and the subquery itself is inherently efficient). So be wary of such "rules" and as another prior answer states it:
[The allegedly performance and technical differences between EXISTS and IN may result from specific vendor's implementations/limitations/bugs, but many times they are nothing but myths created due to lack of understanding of the databases internals.]
You are correct about the YES/NO nature of using EXISTS
, it merely tests if a match can be found, and if found returns TRUE, otherwise FALSE. However, we cannot know if the subquery used is in memory or not.
For knowing when to choose IN
or EXISTS
, in truth there is no absolute guide, and all I can do is list out some guidance:
-
If the "selective predicates(s)" are self-contained in the subquery, use
IN
, e.g.,select t1.* from t1 where t1.x in (select t2.x from t2 where t2.y = 'foo')
-
If the "selective predicates(s)" are from the outer table, use
EXISTS
, e.g.,select t1.* from t1 where exists (select null from t2 where t2.x=t1.x)
-
Where the relationships between the outer and inner tables are complex, use
EXISTS
, e.g.,select t1.* from t1 where exists (select null from t2 where t2.x=t1.x and t2.y>t1.y and t2.z like '%' || t1.z || '%')
-
If seeking the negative of
IN
i.e.NOT IN
, be aware that if the subquery returns NULL, then the whole result will be NULL (and this can be unexpected), so take steps to exclude NULL from the subquery result, e.g.,select t1.* from t1 where t1.x NOT in (select t2.x from t2 where t2.x IS NOT NULL)
- NB: The term "selective predicate" is used to mean any predicate(s) that filter out a significant number of rows.
I recommend you also review the many prior answers to this: "Difference between EXISTS and IN in SQL."
英文:
There is no single simple answer to "how to choose a statement next time" as the conditions of different tables/indices/statistics etc. will all influence the outcome. Furthermore, there is a lot of myth and legend around the topics of IN
and EXISTS
which generally result in a rule of thumb like this: "*The general rule of thumb is that if the subquery contains a large volume of data, the EXISTS operator provides a better performance."
However such rules of thumb are difficult (if not impossible) to prove and yet again may not be true if conditions are suited to use of IN() (e.g. if the in lists an indexed column and the subquery itself is inherently efficient). So be wary of such "rules" and as another prior answer states it:
You are correct about the YES/NO nature of using EXISTS
, it merely tests if a match can be found, and if found returns TRUE, otherwise FALSE. However we cannot know if the subquery used is in memory or not.
For knowing when to choose IN
or EXISTS
, in truth there is no absolute guide and all I can do is list out some guidance:
-
If the "selective predicates(s)" is/are self-contained to the subquery use
IN
e.g.
<pre>
select t1.*
from t1
where t1.x in (select t2.x from t2 where <b>t2.y = 'foo'</b>)
</pre> -
If the "selective predicates(s)" is/are from the outer table, use
EXISTS
(refer)e.g.
<pre>
select t1.*
from t1
where exists (select null
from t2
where <b>t2.x=t1.x</b>
)
</pre>
Also note here that anEXISTS
does not actually need to return any values for evaluation so you may useselect null
orselect 1
orselect *
as you prefer. In older variants of some dbs there was a penalty is using *, but this is not the case for most (or all?) dbs today. -
Where the relationships between the outer and inner tables are complex use
EXISTS
(nb: borrowed from this answer)e.g.
<pre>
select t1.*
from t1
where exists (select null
from t2
where <b>t2.x=t1.x
and t2.y>t1.y
and t2.z like '℅' || t1.z || '℅'</b>
)
</pre>
Note that to approach an equivalence of that usingIN
is so much more cumbersome (and might not be allowed as t1 is referenced 2 tiers down):
<pre>
select t1.*
from t1
where t1.x in (select t2.x
from t2
where <b>t2.x=t1.x
and t2.y > (select t1.y from t1 where t1.x = t2.x)
and t2.z like '%' || (select t1.z from t1 where t1.x = t2.x) || '%'</b>
)</pre> -
If seeking the negative of
IN
i.e.NOT IN
be aware that if the subquery returns NULL then the whole result will be NULL (and this can be unexpected) so take steps to exclude NULL from the subquery result> e.g.
<pre>
select t1.*
from t1
where t1.x <b>NOT</b> in (select t2.x from t2 where <b>t2.x IS NOT NULL</b>)
</pre>
- nb: the term "selective predicate" is used to mean any predicate(s) that filters out a significant number of rows.
I recommend you also review the many prior answers to this: "Difference between EXISTS and IN in SQL?"
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论