在MySQL中检查逗号分隔的字符串相似性。

huangapple go评论103阅读模式
英文:

Comma separated string similarity check in MySQL

问题

我有一个包含逗号分隔的id的表格,像这样:

ids 
----
1,2,3,4
1,4,5
1,5
2
2,6,9

现在我需要根据给定的字符串相似性(共同元素)对这些id进行排序。例如,如果字符串是1,5,我需要的结果是:

1,5(完全相同)
1,4,5(包含1,5,但还有一个额外的数字)
1,2,3,4(只包含1)
2(无所谓)
2,6,9(无所谓)

我的问题是MySQL是否有内置函数可以实现上述结果,还是我需要编写自定义过程?我尝试了match-against语法,但结果不可接受。

注意:

实际上,这些数字是标签id,我想找到最相似的产品。

提前谢谢你。

英文:

I have a table with column that contain comma separated ids like :

ids 
----
1,2,3,4
1,4,5
1,5
2
2,6,9

Now I need to sort these ids based on a given string similarity (common elements). For example if the string be 1,5 the result I need is :

1,5 (exactly the same)
1,4,5 (has 1,5 but also has an extra number)
1,2,3,4 (has only 1)
2 (no matter)
2,6,9 (no matter)

My question is that MySQL has a built-in function to reach above result or I have to write a custom procedure? I tried match-against syntax but result was not acceptable.

Note:

In fact these numbers are tag id and I want to find the most similar product.

Thank you in advance

答案1

得分: 1

这是一个处理逗号分隔的ID的函数,用于解决类似的问题:

它会对每个ID进行评分(这对我有用,但你可能需要添加一些逻辑)。

DELIMITER //
CREATE FUNCTION similarity_score(input_string VARCHAR(255), tags_string VARCHAR(255)) RETURNS INT
BEGIN
DECLARE score INT DEFAULT 0;
DECLARE tag VARCHAR(255);
DECLARE remainder VARCHAR(255);

SET remainder = input_string;

WHILE LENGTH(remainder) > 0 DO
    IF LOCATE(',', remainder) > 0 THEN
        SET tag = SUBSTRING(remainder, 1, LOCATE(',', remainder) - 1);
        SET remainder = SUBSTRING(remainder, LOCATE(',', remainder) + 1);
    ELSE
        SET tag = remainder;
        SET remainder = '';
    END IF;

    IF FIND_IN_SET(tag, tags_string) THEN
        SET score = score + 1;
    END IF;
END WHILE;

RETURN score;

END //
DELIMITER ;

-- 使用方法
SET @input_tags = '1,5';

SELECT ids
FROM your_table
ORDER BY similarity_score(@input_tags, ids) DESC, LENGTH(ids), ids;

英文:

It is not the right approach to store comma-separate IDs, but I solved a similar problem in the past with this function:

It will score each match of the IDs(that worked for me, you might need to add some logic.

DELIMITER //
CREATE FUNCTION similarity_score(input_string VARCHAR(255), tags_string VARCHAR(255)) RETURNS INT
BEGIN
    DECLARE score INT DEFAULT 0;
    DECLARE tag VARCHAR(255);
    DECLARE remainder VARCHAR(255);
    
    SET remainder = input_string;

    WHILE LENGTH(remainder) > 0 DO
        IF LOCATE(',', remainder) > 0 THEN
            SET tag = SUBSTRING(remainder, 1, LOCATE(',', remainder) - 1);
            SET remainder = SUBSTRING(remainder, LOCATE(',', remainder) + 1);
        ELSE
            SET tag = remainder;
            SET remainder = '';
        END IF;

        IF FIND_IN_SET(tag, tags_string) THEN
            SET score = score + 1;
        END IF;
    END WHILE;

    RETURN score;
END //
DELIMITER ;

-- Usage
SET @input_tags = '1,5';

SELECT ids 
FROM your_table
ORDER BY similarity_score(@input_tags, ids) DESC, LENGTH(ids), ids;

答案2

得分: 1

要找到匹配数字的数量,可以执行以下操作:

SET @compare = '1,5';
WITH cte as (
  SELECT
      r as m,
      a as x, 
      SUBSTRING_INDEX(SUBSTRING_INDEX(ids,',',a),',',-1) as s,
      ids
  FROM (select row_number() over (order by ids) as r, ids from mytable )m
  CROSS JOIN (select 1 as a union all select 2 union all select 3 union all select 4) b
  WHERE b.a <= length(m.ids)-length(replace(m.ids,',',''))+1
)
SELECT 
  ids, COUNT(*) as matches
FROM cte 
WHERE LOCATE(s,@compare)<>0
GROUP BY ids
;

输出结果:

ids matches
1,2,3,4 1
1,4,5 2
1,5 2

参考:DBFIDDLE

在更多时间投入到这个问题时,你应该能够给1,51,4,5更高的排名。

英文:

To find the count of matching numbers you can do:

SET @compare = &#39;1,5&#39;;
WITH cte as (
  SELECT
      r as m,
      a as x, 
      SUBSTRING_INDEX(SUBSTRING_INDEX(ids,&#39;,&#39;,a),&#39;,&#39;,-1) as s,
      ids
  FROM (select row_number() over (order by ids) as r, ids from mytable )m
  CROSS JOIN (select 1 as a union all select 2 union all select 3 union all select 4) b
  WHERE b.a &lt;= length(m.ids)-length(replace(m.ids,&#39;,&#39;,&#39;&#39;))+1
)
SELECT 
  ids, COUNT(*) as matches
FROM cte 
WHERE LOCATE(s,@compare)&lt;&gt;0
GROUP BY ids
;

output

ids matches
1,2,3,4 1
1,4,5 2
1,5 2

see: DBFIDDLE

When putting more time in this you should be able to give 1,5 a higher ranking than 1,4,5.

答案3

得分: 0

如果数字在1到10之间(或其他小范围),可以使用SMALLINT UNSIGNED中的位来表示这些数字。所以,对于"1,4,5",可以使用以下值:

1<<1 | 1<<4 | 1<<5

即:2 | 16 | 32 = 50

然后对于"相似性",计算相同位的数量。集合"1,2,5"是数字38,所以共有相同位的数量为:

SELECT BIT_COUNT(50 & 38)

结果为2(即"1,5")。你可以拆解50 & 38 - 34 = 2 + 32,得到"1,5"。

我建议在客户端代码中使用OR运算来构建数字。例如,PHP可以使用以下代码:

$bits = 0;
foreach (explode("1,4,5") as $b) {
    $bits |= (1<<$b);
}     // 结果为 $bits == 50

通过其他布尔运算,你可以(例如)计算一个数字中打开的位数,而另一个数字中没有的位数。

我目前正在开发一个"wordle"游戏,并使用一个26位的INT UNSIGNED来表示单词中的字母。这导致了两个简单的布尔表达式,用于询问"这些字母是否都在单词中"和"那些字母是否都不在单词中"。ord($letter) - ord('A')可以将字母A..Z映射为0..25。

此外,通过使用以下语句:

ORDER BY(BIT_COUNT(col, 38)) DESC
LIMIT 4

SELECT语句将选择"1,2,5"的最佳4个匹配项。

英文:

If the numbers are between 1 and 10 (or some other small range), use the bits in a SMALLINT UNSIGNED to represent the numbers. So, for "1,4,5", use the value

1&lt;&lt;1 | 1&lt;&lt;4 | 1&lt;&lt;5

That is: 2 | 16 | 32 = 50

Then for "similarity", count the number of bits that are the same. The set "1,2,5" is the number 38, so the number of bits in common is

SELECT BIT_COUNT(50 &amp; 38)

is 2 (namely "1,5"). You could deconstruct 50 & 38 - 34 = 2 + 32 to discover "1,5".

I recommend doing the ORing to construct the number in client code. For example, PHP can use

$bits = 0;
foreach (explode(&quot;1,4,5&quot;) as $b) {
    $bits |= (1&lt;&lt;$b);
}     // results in $bits == 50

With other boolean operations, you could (for example) compute how many bits are on in one number that are not in the other.

I am currently developing a "wordle" game and using a 26-bit INT UNSIGNED to say which letters are in a word. This leads to two simple boolean expressions to ask "are all of these letters in the word" and "are all of those letters not in the word". ord($letter) - ord(&#39;A&#39;) gives me 0..25 for A..Z.

Furthermore, by using

ORDER BY(BIT_COUNT(col, 38)) DESC
LIMIT 4

the SELECT will pick the "best 4 matches for "1,2,5".

huangapple
  • 本文由 发表于 2023年8月8日 23:15:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76860929.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定