存储过程用于无循环或游标的表搜索。

huangapple go评论62阅读模式
英文:

Stored procedure for table search without cycles or cursors

问题

I have translated the content you provided. Here is the translated part:

我在我的 SQL Server 2008 数据库中有一个名为 vwValues 的视图,它包含以下列:

ExamObjID (uniqueidentifier)
Locus (varchar(10))
ValOrder (tinyint)
Value (varchar(5))
IndexType (char(1))
PersCount (tinyint))

以及类似以下示例数据:

ExamObjID Locus ValOrder Value IndexType PersCount
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 L1T1234 1 12 C 1
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 L1T1234 2 12 C 1
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 L2T54332 1 14 C 1
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 L2T54332 2 15 C 1
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 L3R12 1 11 C 1
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 L3R12 2 17 C 1

对于任何 ExamObjID,视图中可以包含 6 到 27 个 Locus,每个 Locus 最多包含 4 个 Value(以及相应的 ValOrder 枚举了该 Value)。此外,一个 ExamObjID 只能具有一个不同的 IndexTypePersCount

数据库还有一个存储过程,接受特定的 ExamObjID,创建一个与 vwValues 视图相同结构的表变量 @Target(省略 IndexType 和 PersCount 列),并将该表用该 ExamObjID 的所有条目填充。

现在,存储过程的功能是选择视图中的其他 ExamObjID,启动一些相当繁琐的游标,调用标量函数,以比较 @target 表与视图中每个具有 PersCount = 1 的其他 ExamObjID

比较的条件是:如果两个 ExamObjID 具有至少 6 个相同的 Locus,并且在每个(最多 2 个例外)相同的 Locus 中至少有 1 个相同的 Value,则这两个 ExamObjID 被视为 "匹配"。

每个 "匹配" 的 ExamObject 被写入 SearchResults 表(SearchResultID uniqueidentifier, ExamObjID uniqueidentifier, MatchObjID uniqueidentifier, MisMatchCount (tinyint)),然后游标移动到视图中的下一个 ExamObjID。MisMatchCount 列存储在比较过程中不匹配的 Locus 数量,因此它的值介于 0 和 2 之间。

毫无疑问,这整个方案非常慢,在具有数十万个 ID 的数据库中,每次搜索可能需要大约 15 分钟。

所以问题是:是否有办法用一些无循环的代码块替换所有这些 "游标和标量函数" 部分?我不是专业程序员,但真的想改进这部分以减少搜索时间。非常感激任何提示和想法。同时,我真的不知道是否有可能实现这一点,请告诉我是否不可能,我将寻找其他改进情况的方法。

英文:

I have a view vwValues in my SQL Server 2008 database with these columns:

ExamObjID (uniqueidentifier)
Locus (varchar(10))
ValOrder (tinyint)
Value (varchar(5))
IndexType (char(1))
PersCount (tinyint))

and sample data like this:

ExamObjID Locus ValOrder Value IndexType PersCount
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 L1T1234 1 12 C 1
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 L1T1234 2 12 C 1
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 L2T54332 1 14 C 1
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 L2T54332 2 15 C 1
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 L3R12 1 11 C 1
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 L3R12 2 17 C 1

So for any ExamObjID there is anywhere between 6 and 27 Locus each with up to 4 Value's (and respective ValOrder enumerating that Value's ). Also one ExamObjID can have only one distinct IndexType and PersCount.

The database also has a stored procedure that accepts specific ExamObjID , builds a table variable @Target with the same structure as vwValues view (omiting IndexType and PersCount columns) and populates it with all entries from this view that have such ExamObjID .

Now what it makes is selects other ExamObjID 's from the view, starts some rather cumbersome cursor that calls for scalar functions - all to compare the @target table to every other ExamObjID in view that has PersCount = 1 .

The comparison terms are: the two ExamObjID 's are considered a "match" if they have at least 6 identical Locus and in every (exept for maximum 2) identical Locus they have at least 1 identical Value.

Every "matching" ExamObject is written into SearchResults Table (SearchResultID uniqueidentifier, ExamObjID uniqueidentifier, MatchObjID uniqueidentifier, MisMatchCount (tinyint)) and then the cursor moves on to the next ExamObjID in the view. MisMatchCount column stores the quantity of Locus that mismathed during comparison so it stores values between 0 and 2.

Needless to say all this scheme is very slow and for a DB that has hundreds of thousands ID's it can take anywhere around 15 minutes per one search.

So the question is: is there any way to replace all this "cursors and scalar functions" part with some neat cycleless code block? I am nowhere around being a professional coder but really want to improve this part to reduce search times. Will be very grateful for any tips and ideas. Also I do not really know if it is even possible to achieve so please tell me it is impossible and I will search for another ways to improve the situation.

UPD1: more about matching rules.

Per-Locus match:
this Locus of ExamObjID f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 :

ExamObjID Locus ValOrder Value
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 L1T1234 1 12
f9576b7b-20a1-4d63-8c80-b8ffd68aefc5 L1T1234 2 14

is considered a match with this Locus of ExamObjID e1185832-a6b7-4b08-913c-fb75be0f8588:

ExamObjID Locus ValOrder Value
e1185832-a6b7-4b08-913c-fb75be0f8588 L1T1234 1 14
e1185832-a6b7-4b08-913c-fb75be0f8588 L1T1234 2 15

because they share at least one common Value = 14.

Per-ExamObjID match: all common Locus in two ExamObjID must match with exception of maximum 2 Locus'es.

UPD2:
tried to create a snippet that simulates new stored procedure. still doing something wrong, it just returns all other ExamObjects from the database like they have 0 mismatches.

DECLARE @ExamObjID UNIQUEIDENTIFIER = '62DA5C53-E70A-473A-923B-388232B79AFF'
DECLARE @Target TABLE (
	ExamObjID UNIQUEIDENTIFIER,
	Locus VARCHAR(15),
	Value VARCHAR(5)
);
DECLARE @Result TABLE (ExamObjID uniqueidentifier, MismatchCount tinyint);

INSERT INTO @Target 
	SELECT ExamObjID, Locus, Value FROM vwValues WHERE ExamObjID = @ExamObjID;

BEGIN
    SET NOCOUNT ON;

    SELECT DISTINCT a.ExamObjID, 
           COUNT(CASE WHEN b.value = a.value THEN 1 END) AS Matches, 
           COUNT(CASE WHEN b.value <> a.value THEN 1 END) AS Mismatches
    INTO #temp
    FROM vwValues a
    INNER JOIN @Target b
    ON a.Locus = b.Locus AND a.value = b.value AND a.ExamObjID <> b.ExamObjID
    GROUP BY a.ExamObjID
    
    INSERT INTO @Result(ExamObjID, MismatchCount)
    SELECT ExamObjID, Mismatches
    FROM #temp
    WHERE Matches >= 6 AND Mismatches <= 2;
    
    DROP TABLE #temp;
	SELECT * FROM @Result
END

UPD[3]:
More on per-ExamObjID match and "common locus'es" between 2 ExamObjects.
Assume we have 2 ExamObjects in vwValues with 2 different ExamObjID's. Those ExamObjID's are uniqueidentifiers, but lets assume they are stored in @Id1 and @Id2 variables.
Query SELECT DISTINCT Locus FROM vwValues WHERE ExamObjID = @Id1 returns a list:
L1T1234, L2T54332, L3R12, L4R4, L5RTdev
while query SELECT DISTINCT Locus FROM vwValue WHERE ExamObjID = @Id2 returns a list: L3R12,L4R4,L5RTdev,L6T10,L7indel,L8cascade
So there are three Locus'es that are common for @Id1 and @Id2: L3R12,L4R4,L5RTdev. Only data for those 3 Locus is relevant when comparing ExamObjID's data. In real database every ExamObjID has anywhere from 8 to 27 distinct Locus and the search algorythm must assume that to compare data for 2 ExamObjID's they must have at least 6 common distinct Locus.
So the fact that 2 ExamObjID's have at least 6 common Locus is a requirement for their comparison to begin to establish if they can be a match.
Now about the matching rules. For every common Locus for two ExamObjID's we compare there must be at least 1 identical Value. The example of 2 matching Locus provided previously in the section "UPD1" of this post. Exeption can be for at most two common Locus, meaning that if two ExamObjects have 10 common Locus at least 8 of that Locus must match.
If any matching ExamObjID's were found, the procedure stores this result in a table SearchResults with columns

ExamObjID uniqueidentifier, -- target ExamObjID
MatchObjID uniqueidentifier, -- Matching ExamObjID,
MisMatchCnt tinyint -- number of mismatched common Locus

Also there is a script with schema and data for a tiny database I made for testing purpose. There are 10 distinct ExamObjects that have either identical or slightly different values in common Loci.
<https://drive.google.com/file/d/17omKglwPCbjvqmT1UYRkt-izi3ykQoOy/view?usp=sharing>

答案1

得分: 1

Yes you can do this all in one query.

This is a type of query called Relational Division With Remainder, with the added tweak that you want certain conditions per divided set.

I would advise, given that you appear to be querying from a view, to put the divisor (the Target) into a table variable that has a primary key. In some situations you can just combine the first SELECT below directly into the second.

DECLARE @ExamObjID UNIQUEIDENTIFIER = '62DA5C53-E70A-473A-923B-388232B79AFF'
DECLARE @Target TABLE (
    ExamObjID UNIQUEIDENTIFIER NOT NULL,
    Locus VARCHAR(15) NOT NULL,
    Value VARCHAR(5) NOT NULL,
    PRIMARY KEY (Locus, ExamObjID, Value)
);

INSERT INTO @Target (ExamObjID, Locus, Value)
SELECT ExamObjID, Locus, Value
FROM vwValues
WHERE ExamObjID = @ExamObjID;

SELECT
  other.ExamObjID,
  Mismatches = COUNT(CASE WHEN other.Matches = 0 THEN 1 END)
FROM (
    SELECT
      other.ExamObjID,
      other.Locus,
      Matches = COUNT(CASE WHEN other.Value = t.Value THEN 1 END)
    FROM vwValues other
    JOIN @Target t ON t.Locus = other.Locus
    WHERE t.ExamObjID <> other.ExamObjID
    GROUP BY
      other.ExamObjID,
      other.Locus
) other
GROUP BY
  other.ExamObjID
HAVING COUNT(*) >= 6
   AND COUNT(CASE WHEN other.Matches = 0 THEN 1 END) <= 2;

db<>fiddle

The logic is as follows:

  • Within a derived subquery, join the target set with the rest of the table (excluding itself) joined only by Locus.
  • Group by both ExamObjID and Locus, taking a count of rows which match by Value.
  • Group again on the outside just by ExamObjID
  • Exclude any groups which have less than 6 rows per Locus...
  • ... and any which have their number of "rows that have no matches" more than 2
  • Finally output the ExamObjID along with that number of mismatches.
英文:

Yes you can do this all in one query.

This is a type of query called Relational Division With Remainder, with the added tweak that you want certain conditions per divided set.

I would advise, given that you appear to be querying from a view, to put the divisor (the Target) into a table variable that has a primary key. In some situations you can just combine the first SELECT below directly into the second.

DECLARE @ExamObjID UNIQUEIDENTIFIER = &#39;62DA5C53-E70A-473A-923B-388232B79AFF&#39;
DECLARE @Target TABLE (
    ExamObjID UNIQUEIDENTIFIER NOT NULL,
    Locus VARCHAR(15) NOT NULL,
    Value VARCHAR(5) NOT NULL,
    PRIMARY KEY (Locus, ExamObjID, Value)
);

INSERT INTO @Target (ExamObjID, Locus, Value)
SELECT ExamObjID, Locus, Value
FROM vwValues
WHERE ExamObjID = @ExamObjID;

SELECT
  other.ExamObjID,
  Mismatches = COUNT(CASE WHEN other.Matches = 0 THEN 1 END)
FROM (
    SELECT
      other.ExamObjID,
      other.Locus,
      Matches = COUNT(CASE WHEN other.Value = t.Value THEN 1 END)
    FROM vwValues other
    JOIN @Target t ON t.Locus = other.Locus
    WHERE t.ExamObjID &lt;&gt; other.ExamObjID
    GROUP BY
      other.ExamObjID,
      other.Locus
) other
GROUP BY
  other.ExamObjID
HAVING COUNT(*) &gt;= 6
   AND COUNT(CASE WHEN other.Matches = 0 THEN 1 END) &lt;= 2;

db<>fiddle

The logic is as follows:

  • Within a derived subquery, join the target set with the rest of the table (excluding itself) joined only by Locus.

  • Group by both ExamObjID and Locus, taking a count of rows which match by Value.

  • Group again on the outside just by ExamObjID

  • Exclude any groups which have less than 6 rows ie per Locus...

  • ... and any which have their number of "rows that have no matches" more than 2

  • Finally output the ExamObjID along with that number of mismatches.

huangapple
  • 本文由 发表于 2023年3月31日 23:58:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/75900540.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定