英文:
Stored procedure for table search without cycles or cursors
问题
I have translated the content you provided. Here is the translated part:
我在我的 SQL Server 2008 数据库中有一个名为 vwValues
的视图,它包含以下列:
ExamObjID (uniqueidentifier)
Locus (varchar(10))
ValOrder (tinyint)
Value (varchar(5))
IndexType (char(1))
PersCount (tinyint))
以及类似以下示例数据:
ExamObjID | Locus | ValOrder | Value | IndexType | PersCount |
---|---|---|---|---|---|
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 | L1T1234 | 1 | 12 | C | 1 |
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 | L1T1234 | 2 | 12 | C | 1 |
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 | L2T54332 | 1 | 14 | C | 1 |
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 | L2T54332 | 2 | 15 | C | 1 |
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 | L3R12 | 1 | 11 | C | 1 |
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 | L3R12 | 2 | 17 | C | 1 |
对于任何 ExamObjID
,视图中可以包含 6 到 27 个 Locus
,每个 Locus
最多包含 4 个 Value
(以及相应的 ValOrder
枚举了该 Value
)。此外,一个 ExamObjID
只能具有一个不同的 IndexType
和 PersCount
。
数据库还有一个存储过程,接受特定的 ExamObjID
,创建一个与 vwValues
视图相同结构的表变量 @Target
(省略 IndexType 和 PersCount 列),并将该表用该 ExamObjID
的所有条目填充。
现在,存储过程的功能是选择视图中的其他 ExamObjID
,启动一些相当繁琐的游标,调用标量函数,以比较 @target
表与视图中每个具有 PersCount = 1
的其他 ExamObjID
。
比较的条件是:如果两个 ExamObjID
具有至少 6 个相同的 Locus
,并且在每个(最多 2 个例外)相同的 Locus
中至少有 1 个相同的 Value
,则这两个 ExamObjID
被视为 "匹配"。
每个 "匹配" 的 ExamObject
被写入 SearchResults
表(SearchResultID uniqueidentifier, ExamObjID uniqueidentifier, MatchObjID uniqueidentifier, MisMatchCount (tinyint)
),然后游标移动到视图中的下一个 ExamObjID
。MisMatchCount 列存储在比较过程中不匹配的 Locus 数量,因此它的值介于 0 和 2 之间。
毫无疑问,这整个方案非常慢,在具有数十万个 ID 的数据库中,每次搜索可能需要大约 15 分钟。
所以问题是:是否有办法用一些无循环的代码块替换所有这些 "游标和标量函数" 部分?我不是专业程序员,但真的想改进这部分以减少搜索时间。非常感激任何提示和想法。同时,我真的不知道是否有可能实现这一点,请告诉我是否不可能,我将寻找其他改进情况的方法。
英文:
I have a view vwValues
in my SQL Server 2008 database with these columns:
ExamObjID (uniqueidentifier)
Locus (varchar(10))
ValOrder (tinyint)
Value (varchar(5))
IndexType (char(1))
PersCount (tinyint))
and sample data like this:
ExamObjID | Locus | ValOrder | Value | IndexType | PersCount |
---|---|---|---|---|---|
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 | L1T1234 | 1 | 12 | C | 1 |
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 | L1T1234 | 2 | 12 | C | 1 |
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 | L2T54332 | 1 | 14 | C | 1 |
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 | L2T54332 | 2 | 15 | C | 1 |
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 | L3R12 | 1 | 11 | C | 1 |
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 | L3R12 | 2 | 17 | C | 1 |
So for any ExamObjID
there is anywhere between 6 and 27 Locus
each with up to 4 Value
's (and respective ValOrder
enumerating that Value
's ). Also one ExamObjID
can have only one distinct IndexType
and PersCount
.
The database also has a stored procedure that accepts specific ExamObjID
, builds a table variable @Target
with the same structure as vwValues
view (omiting IndexType and PersCount columns) and populates it with all entries from this view that have such ExamObjID
.
Now what it makes is selects other ExamObjID
's from the view, starts some rather cumbersome cursor that calls for scalar functions - all to compare the @target
table to every other ExamObjID
in view that has PersCount = 1
.
The comparison terms are: the two ExamObjID
's are considered a "match" if they have at least 6 identical Locus
and in every (exept for maximum 2) identical Locus
they have at least 1 identical Value
.
Every "matching" ExamObject
is written into SearchResults
Table (SearchResultID uniqueidentifier, ExamObjID uniqueidentifier, MatchObjID uniqueidentifier, MisMatchCount (tinyint))
and then the cursor moves on to the next ExamObjID
in the view. MisMatchCount column stores the quantity of Locus that mismathed during comparison so it stores values between 0 and 2.
Needless to say all this scheme is very slow and for a DB that has hundreds of thousands ID's it can take anywhere around 15 minutes per one search.
So the question is: is there any way to replace all this "cursors and scalar functions" part with some neat cycleless code block? I am nowhere around being a professional coder but really want to improve this part to reduce search times. Will be very grateful for any tips and ideas. Also I do not really know if it is even possible to achieve so please tell me it is impossible and I will search for another ways to improve the situation.
UPD1: more about matching rules.
Per-Locus match:
this Locus of ExamObjID f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 :
ExamObjID | Locus | ValOrder | Value |
---|---|---|---|
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 | L1T1234 | 1 | 12 |
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 | L1T1234 | 2 | 14 |
is considered a match with this Locus of ExamObjID e1185832-a6b7-4b08-913c-fb75be0f8588:
ExamObjID | Locus | ValOrder | Value |
---|---|---|---|
e1185832-a6b7-4b08-913c-fb75be0f8588 | L1T1234 | 1 | 14 |
e1185832-a6b7-4b08-913c-fb75be0f8588 | L1T1234 | 2 | 15 |
because they share at least one common Value = 14.
Per-ExamObjID match: all common Locus in two ExamObjID must match with exception of maximum 2 Locus'es.
UPD2:
tried to create a snippet that simulates new stored procedure. still doing something wrong, it just returns all other ExamObjects from the database like they have 0 mismatches.
DECLARE @ExamObjID UNIQUEIDENTIFIER = '62DA5C53-E70A-473A-923B-388232B79AFF'
DECLARE @Target TABLE (
ExamObjID UNIQUEIDENTIFIER,
Locus VARCHAR(15),
Value VARCHAR(5)
);
DECLARE @Result TABLE (ExamObjID uniqueidentifier, MismatchCount tinyint);
INSERT INTO @Target
SELECT ExamObjID, Locus, Value FROM vwValues WHERE ExamObjID = @ExamObjID;
BEGIN
SET NOCOUNT ON;
SELECT DISTINCT a.ExamObjID,
COUNT(CASE WHEN b.value = a.value THEN 1 END) AS Matches,
COUNT(CASE WHEN b.value <> a.value THEN 1 END) AS Mismatches
INTO #temp
FROM vwValues a
INNER JOIN @Target b
ON a.Locus = b.Locus AND a.value = b.value AND a.ExamObjID <> b.ExamObjID
GROUP BY a.ExamObjID
INSERT INTO @Result(ExamObjID, MismatchCount)
SELECT ExamObjID, Mismatches
FROM #temp
WHERE Matches >= 6 AND Mismatches <= 2;
DROP TABLE #temp;
SELECT * FROM @Result
END
UPD[3]:
More on per-ExamObjID
match and "common locus
'es" between 2 ExamObject
s.
Assume we have 2 ExamObjects in vwValues with 2 different ExamObjID's. Those ExamObjID's are uniqueidentifier
s, but lets assume they are stored in @Id1
and @Id2
variables.
Query SELECT DISTINCT Locus FROM vwValues WHERE ExamObjID = @Id1
returns a list:
L1T1234, L2T54332, L3R12, L4R4, L5RTdev
while query SELECT DISTINCT Locus FROM vwValue WHERE ExamObjID = @Id2
returns a list: L3R12,L4R4,L5RTdev,L6T10,L7indel,L8cascade
So there are three Locus
'es that are common for @Id1
and @Id2
: L3R12,L4R4,L5RTdev
. Only data for those 3 Locus
is relevant when comparing ExamObjID
's data. In real database every ExamObjID
has anywhere from 8 to 27 distinct Locus
and the search algorythm must assume that to compare data for 2 ExamObjID
's they must have at least 6 common distinct Locus
.
So the fact that 2 ExamObjID
's have at least 6 common Locus
is a requirement for their comparison to begin to establish if they can be a match.
Now about the matching rules. For every common Locus
for two ExamObjID
's we compare there must be at least 1 identical Value
. The example of 2 matching Locus
provided previously in the section "UPD1" of this post. Exeption can be for at most two common Locus
, meaning that if two ExamObject
s have 10 common Locus
at least 8 of that Locus
must match.
If any matching ExamObjID
's were found, the procedure stores this result in a table SearchResults
with columns
ExamObjID uniqueidentifier, -- target ExamObjID
MatchObjID uniqueidentifier, -- Matching ExamObjID,
MisMatchCnt tinyint -- number of mismatched common Locus
Also there is a script with schema and data for a tiny database I made for testing purpose. There are 10 distinct ExamObjects that have either identical or slightly different values in common Loci.
<https://drive.google.com/file/d/17omKglwPCbjvqmT1UYRkt-izi3ykQoOy/view?usp=sharing>
答案1
得分: 1
Yes you can do this all in one query.
This is a type of query called Relational Division With Remainder, with the added tweak that you want certain conditions per divided set.
I would advise, given that you appear to be querying from a view, to put the divisor (the Target) into a table variable that has a primary key. In some situations you can just combine the first SELECT
below directly into the second.
DECLARE @ExamObjID UNIQUEIDENTIFIER = '62DA5C53-E70A-473A-923B-388232B79AFF'
DECLARE @Target TABLE (
ExamObjID UNIQUEIDENTIFIER NOT NULL,
Locus VARCHAR(15) NOT NULL,
Value VARCHAR(5) NOT NULL,
PRIMARY KEY (Locus, ExamObjID, Value)
);
INSERT INTO @Target (ExamObjID, Locus, Value)
SELECT ExamObjID, Locus, Value
FROM vwValues
WHERE ExamObjID = @ExamObjID;
SELECT
other.ExamObjID,
Mismatches = COUNT(CASE WHEN other.Matches = 0 THEN 1 END)
FROM (
SELECT
other.ExamObjID,
other.Locus,
Matches = COUNT(CASE WHEN other.Value = t.Value THEN 1 END)
FROM vwValues other
JOIN @Target t ON t.Locus = other.Locus
WHERE t.ExamObjID <> other.ExamObjID
GROUP BY
other.ExamObjID,
other.Locus
) other
GROUP BY
other.ExamObjID
HAVING COUNT(*) >= 6
AND COUNT(CASE WHEN other.Matches = 0 THEN 1 END) <= 2;
The logic is as follows:
- Within a derived subquery, join the target set with the rest of the table (excluding itself) joined only by
Locus
. - Group by both
ExamObjID
andLocus
, taking a count of rows which match byValue
. - Group again on the outside just by
ExamObjID
- Exclude any groups which have less than 6 rows per
Locus
... - ... and any which have their number of "rows that have no matches" more than 2
- Finally output the
ExamObjID
along with that number of mismatches.
英文:
Yes you can do this all in one query.
This is a type of query called Relational Division With Remainder, with the added tweak that you want certain conditions per divided set.
I would advise, given that you appear to be querying from a view, to put the divisor (the Target) into a table variable that has a primary key. In some situations you can just combine the first SELECT
below directly into the second.
DECLARE @ExamObjID UNIQUEIDENTIFIER = '62DA5C53-E70A-473A-923B-388232B79AFF'
DECLARE @Target TABLE (
ExamObjID UNIQUEIDENTIFIER NOT NULL,
Locus VARCHAR(15) NOT NULL,
Value VARCHAR(5) NOT NULL,
PRIMARY KEY (Locus, ExamObjID, Value)
);
INSERT INTO @Target (ExamObjID, Locus, Value)
SELECT ExamObjID, Locus, Value
FROM vwValues
WHERE ExamObjID = @ExamObjID;
SELECT
other.ExamObjID,
Mismatches = COUNT(CASE WHEN other.Matches = 0 THEN 1 END)
FROM (
SELECT
other.ExamObjID,
other.Locus,
Matches = COUNT(CASE WHEN other.Value = t.Value THEN 1 END)
FROM vwValues other
JOIN @Target t ON t.Locus = other.Locus
WHERE t.ExamObjID <> other.ExamObjID
GROUP BY
other.ExamObjID,
other.Locus
) other
GROUP BY
other.ExamObjID
HAVING COUNT(*) >= 6
AND COUNT(CASE WHEN other.Matches = 0 THEN 1 END) <= 2;
The logic is as follows:
-
Within a derived subquery, join the target set with the rest of the table (excluding itself) joined only by
Locus
. -
Group by both
ExamObjID
andLocus
, taking a count of rows which match byValue
. -
Group again on the outside just by
ExamObjID
-
Exclude any groups which have less than 6 rows ie per
Locus
... -
... and any which have their number of "rows that have no matches" more than 2
-
Finally output the
ExamObjID
along with that number of mismatches.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论