2023年8月8日 23:15:10go评论103阅读模式

英文:

Comma separated string similarity check in MySQL

问题

我有一个包含逗号分隔的id的表格，像这样：

ids 
----
1,2,3,4
1,4,5
1,5
2
2,6,9

现在我需要根据给定的字符串相似性（共同元素）对这些id进行排序。例如，如果字符串是1,5，我需要的结果是：

1,5（完全相同）
1,4,5（包含1,5，但还有一个额外的数字）
1,2,3,4（只包含1）
2（无所谓）
2,6,9（无所谓）

我的问题是MySQL是否有内置函数可以实现上述结果，还是我需要编写自定义过程？我尝试了match-against语法，但结果不可接受。

注意：

实际上，这些数字是标签id，我想找到最相似的产品。

提前谢谢你。

英文:

I have a table with column that contain comma separated ids like :

ids 
----
1,2,3,4
1,4,5
1,5
2
2,6,9

Now I need to sort these ids based on a given string similarity (common elements). For example if the string be 1,5 the result I need is :

1,5 (exactly the same)
1,4,5 (has 1,5 but also has an extra number)
1,2,3,4 (has only 1)
2 (no matter)
2,6,9 (no matter)

My question is that MySQL has a built-in function to reach above result or I have to write a custom procedure? I tried match-against syntax but result was not acceptable.

Note:

In fact these numbers are tag id and I want to find the most similar product.

Thank you in advance

答案1

得分: 1

这是一个处理逗号分隔的ID的函数，用于解决类似的问题：

它会对每个ID进行评分（这对我有用，但你可能需要添加一些逻辑）。

DELIMITER //
CREATE FUNCTION similarity_score(input_string VARCHAR(255), tags_string VARCHAR(255)) RETURNS INT
BEGIN
DECLARE score INT DEFAULT 0;
DECLARE tag VARCHAR(255);
DECLARE remainder VARCHAR(255);

SET remainder = input_string;

WHILE LENGTH(remainder) > 0 DO
    IF LOCATE(',', remainder) > 0 THEN
        SET tag = SUBSTRING(remainder, 1, LOCATE(',', remainder) - 1);
        SET remainder = SUBSTRING(remainder, LOCATE(',', remainder) + 1);
    ELSE
        SET tag = remainder;
        SET remainder = '';
    END IF;

    IF FIND_IN_SET(tag, tags_string) THEN
        SET score = score + 1;
    END IF;
END WHILE;

RETURN score;

END //
DELIMITER ;

-- 使用方法
SET @input_tags = '1,5';

SELECT ids
FROM your_table
ORDER BY similarity_score(@input_tags, ids) DESC, LENGTH(ids), ids;

英文:

It is not the right approach to store comma-separate IDs, but I solved a similar problem in the past with this function:

It will score each match of the IDs(that worked for me, you might need to add some logic.

DELIMITER //
CREATE FUNCTION similarity_score(input_string VARCHAR(255), tags_string VARCHAR(255)) RETURNS INT
BEGIN
    DECLARE score INT DEFAULT 0;
    DECLARE tag VARCHAR(255);
    DECLARE remainder VARCHAR(255);
    
    SET remainder = input_string;

    WHILE LENGTH(remainder) &gt; 0 DO
        IF LOCATE(&#39;,&#39;, remainder) &gt; 0 THEN
            SET tag = SUBSTRING(remainder, 1, LOCATE(&#39;,&#39;, remainder) - 1);
            SET remainder = SUBSTRING(remainder, LOCATE(&#39;,&#39;, remainder) + 1);
        ELSE
            SET tag = remainder;
            SET remainder = &#39;&#39;;
        END IF;

        IF FIND_IN_SET(tag, tags_string) THEN
            SET score = score + 1;
        END IF;
    END WHILE;

    RETURN score;
END //
DELIMITER ;

-- Usage
SET @input_tags = &#39;1,5&#39;;

SELECT ids 
FROM your_table
ORDER BY similarity_score(@input_tags, ids) DESC, LENGTH(ids), ids;

答案2

得分: 1

要找到匹配数字的数量，可以执行以下操作：

SET @compare = '1,5';
WITH cte as (
  SELECT
      r as m,
      a as x, 
      SUBSTRING_INDEX(SUBSTRING_INDEX(ids,',',a),',',-1) as s,
      ids
  FROM (select row_number() over (order by ids) as r, ids from mytable )m
  CROSS JOIN (select 1 as a union all select 2 union all select 3 union all select 4) b
  WHERE b.a <= length(m.ids)-length(replace(m.ids,',',''))+1
)
SELECT 
  ids, COUNT(*) as matches
FROM cte 
WHERE LOCATE(s,@compare)<>0
GROUP BY ids
;

输出结果：

ids	matches
1,2,3,4	1
1,4,5	2
1,5	2

参考：DBFIDDLE

在更多时间投入到这个问题时，你应该能够给1,5比1,4,5更高的排名。

英文:

To find the count of matching numbers you can do:

SET @compare = &#39;1,5&#39;;
WITH cte as (
  SELECT
      r as m,
      a as x, 
      SUBSTRING_INDEX(SUBSTRING_INDEX(ids,&#39;,&#39;,a),&#39;,&#39;,-1) as s,
      ids
  FROM (select row_number() over (order by ids) as r, ids from mytable )m
  CROSS JOIN (select 1 as a union all select 2 union all select 3 union all select 4) b
  WHERE b.a &lt;= length(m.ids)-length(replace(m.ids,&#39;,&#39;,&#39;&#39;))+1
)
SELECT 
  ids, COUNT(*) as matches
FROM cte 
WHERE LOCATE(s,@compare)&lt;&gt;0
GROUP BY ids
;

output

ids	matches
1,2,3,4	1
1,4,5	2
1,5	2

see: DBFIDDLE

When putting more time in this you should be able to give 1,5 a higher ranking than 1,4,5.

答案3

得分: 0

如果数字在1到10之间（或其他小范围），可以使用SMALLINT UNSIGNED中的位来表示这些数字。所以，对于"1,4,5"，可以使用以下值：

1<<1 | 1<<4 | 1<<5

即：2 | 16 | 32 = 50

然后对于"相似性"，计算相同位的数量。集合"1,2,5"是数字38，所以共有相同位的数量为：

SELECT BIT_COUNT(50 & 38)

结果为2（即"1,5"）。你可以拆解50 & 38 - 34 = 2 + 32，得到"1,5"。

我建议在客户端代码中使用OR运算来构建数字。例如，PHP可以使用以下代码：

$bits = 0;
foreach (explode("1,4,5") as $b) {
    $bits |= (1<<$b);
}     // 结果为 $bits == 50

通过其他布尔运算，你可以（例如）计算一个数字中打开的位数，而另一个数字中没有的位数。

我目前正在开发一个"wordle"游戏，并使用一个26位的INT UNSIGNED来表示单词中的字母。这导致了两个简单的布尔表达式，用于询问"这些字母是否都在单词中"和"那些字母是否都不在单词中"。ord($letter) - ord('A')可以将字母A..Z映射为0..25。

此外，通过使用以下语句：

ORDER BY(BIT_COUNT(col, 38)) DESC
LIMIT 4

SELECT语句将选择"1,2,5"的最佳4个匹配项。

英文:

If the numbers are between 1 and 10 (or some other small range), use the bits in a SMALLINT UNSIGNED to represent the numbers. So, for "1,4,5", use the value

1&lt;&lt;1 | 1&lt;&lt;4 | 1&lt;&lt;5

That is: 2 | 16 | 32 = 50

Then for "similarity", count the number of bits that are the same. The set "1,2,5" is the number 38, so the number of bits in common is

SELECT BIT_COUNT(50 &amp; 38)

is 2 (namely "1,5"). You could deconstruct 50 & 38 - 34 = 2 + 32 to discover "1,5".

I recommend doing the ORing to construct the number in client code. For example, PHP can use

$bits = 0;
foreach (explode(&quot;1,4,5&quot;) as $b) {
    $bits |= (1&lt;&lt;$b);
}     // results in $bits == 50

With other boolean operations, you could (for example) compute how many bits are on in one number that are not in the other.

I am currently developing a "wordle" game and using a 26-bit INT UNSIGNED to say which letters are in a word. This leads to two simple boolean expressions to ask "are all of these letters in the word" and "are all of those letters not in the word". ord($letter) - ord('A') gives me 0..25 for A..Z.

Furthermore, by using

ORDER BY(BIT_COUNT(col, 38)) DESC
LIMIT 4

the SELECT will pick the "best 4 matches for "1,2,5".

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在MySQL中检查逗号分隔的字符串相似性。

问题

答案1

答案2

答案3

使用字符串而不是数值在Python中创建3D散点图

无法使用JDBC和Spring连接到MySQL

在SQL中对列进行筛选。

将逗号分隔的字符串迁移到 JSON 数组

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论