英文:
Comma separated string similarity check in MySQL
问题
我有一个包含逗号分隔的id的表格,像这样:
ids
----
1,2,3,4
1,4,5
1,5
2
2,6,9
现在我需要根据给定的字符串相似性(共同元素)对这些id进行排序。例如,如果字符串是1,5,我需要的结果是:
1,5(完全相同)
1,4,5(包含1,5,但还有一个额外的数字)
1,2,3,4(只包含1)
2(无所谓)
2,6,9(无所谓)
我的问题是MySQL是否有内置函数可以实现上述结果,还是我需要编写自定义过程?我尝试了match-against语法,但结果不可接受。
注意:
实际上,这些数字是标签id,我想找到最相似的产品。
提前谢谢你。
英文:
I have a table with column that contain comma separated ids like :
ids
----
1,2,3,4
1,4,5
1,5
2
2,6,9
Now I need to sort these ids based on a given string similarity (common elements). For example if the string be 1,5 the result I need is :
1,5 (exactly the same)
1,4,5 (has 1,5 but also has an extra number)
1,2,3,4 (has only 1)
2 (no matter)
2,6,9 (no matter)
My question is that MySQL has a built-in function to reach above result or I have to write a custom procedure? I tried match-against syntax but result was not acceptable.
Note:
In fact these numbers are tag id and I want to find the most similar product.
Thank you in advance
答案1
得分: 1
这是一个处理逗号分隔的ID的函数,用于解决类似的问题:
它会对每个ID进行评分(这对我有用,但你可能需要添加一些逻辑)。
DELIMITER //
CREATE FUNCTION similarity_score(input_string VARCHAR(255), tags_string VARCHAR(255)) RETURNS INT
BEGIN
DECLARE score INT DEFAULT 0;
DECLARE tag VARCHAR(255);
DECLARE remainder VARCHAR(255);
SET remainder = input_string;
WHILE LENGTH(remainder) > 0 DO
IF LOCATE(',', remainder) > 0 THEN
SET tag = SUBSTRING(remainder, 1, LOCATE(',', remainder) - 1);
SET remainder = SUBSTRING(remainder, LOCATE(',', remainder) + 1);
ELSE
SET tag = remainder;
SET remainder = '';
END IF;
IF FIND_IN_SET(tag, tags_string) THEN
SET score = score + 1;
END IF;
END WHILE;
RETURN score;
END //
DELIMITER ;
-- 使用方法
SET @input_tags = '1,5';
SELECT ids
FROM your_table
ORDER BY similarity_score(@input_tags, ids) DESC, LENGTH(ids), ids;
英文:
It is not the right approach to store comma-separate IDs, but I solved a similar problem in the past with this function:
It will score each match of the IDs(that worked for me, you might need to add some logic.
DELIMITER //
CREATE FUNCTION similarity_score(input_string VARCHAR(255), tags_string VARCHAR(255)) RETURNS INT
BEGIN
DECLARE score INT DEFAULT 0;
DECLARE tag VARCHAR(255);
DECLARE remainder VARCHAR(255);
SET remainder = input_string;
WHILE LENGTH(remainder) > 0 DO
IF LOCATE(',', remainder) > 0 THEN
SET tag = SUBSTRING(remainder, 1, LOCATE(',', remainder) - 1);
SET remainder = SUBSTRING(remainder, LOCATE(',', remainder) + 1);
ELSE
SET tag = remainder;
SET remainder = '';
END IF;
IF FIND_IN_SET(tag, tags_string) THEN
SET score = score + 1;
END IF;
END WHILE;
RETURN score;
END //
DELIMITER ;
-- Usage
SET @input_tags = '1,5';
SELECT ids
FROM your_table
ORDER BY similarity_score(@input_tags, ids) DESC, LENGTH(ids), ids;
答案2
得分: 1
要找到匹配数字的数量,可以执行以下操作:
SET @compare = '1,5';
WITH cte as (
SELECT
r as m,
a as x,
SUBSTRING_INDEX(SUBSTRING_INDEX(ids,',',a),',',-1) as s,
ids
FROM (select row_number() over (order by ids) as r, ids from mytable )m
CROSS JOIN (select 1 as a union all select 2 union all select 3 union all select 4) b
WHERE b.a <= length(m.ids)-length(replace(m.ids,',',''))+1
)
SELECT
ids, COUNT(*) as matches
FROM cte
WHERE LOCATE(s,@compare)<>0
GROUP BY ids
;
输出结果:
ids | matches |
---|---|
1,2,3,4 | 1 |
1,4,5 | 2 |
1,5 | 2 |
参考:DBFIDDLE
在更多时间投入到这个问题时,你应该能够给1,5
比1,4,5
更高的排名。
英文:
To find the count of matching numbers you can do:
SET @compare = '1,5';
WITH cte as (
SELECT
r as m,
a as x,
SUBSTRING_INDEX(SUBSTRING_INDEX(ids,',',a),',',-1) as s,
ids
FROM (select row_number() over (order by ids) as r, ids from mytable )m
CROSS JOIN (select 1 as a union all select 2 union all select 3 union all select 4) b
WHERE b.a <= length(m.ids)-length(replace(m.ids,',',''))+1
)
SELECT
ids, COUNT(*) as matches
FROM cte
WHERE LOCATE(s,@compare)<>0
GROUP BY ids
;
output
ids | matches |
---|---|
1,2,3,4 | 1 |
1,4,5 | 2 |
1,5 | 2 |
see: DBFIDDLE
When putting more time in this you should be able to give 1,5
a higher ranking than 1,4,5
.
答案3
得分: 0
如果数字在1到10之间(或其他小范围),可以使用SMALLINT UNSIGNED
中的位来表示这些数字。所以,对于"1,4,5",可以使用以下值:
1<<1 | 1<<4 | 1<<5
即:2 | 16 | 32 = 50
然后对于"相似性",计算相同位的数量。集合"1,2,5"是数字38,所以共有相同位的数量为:
SELECT BIT_COUNT(50 & 38)
结果为2(即"1,5")。你可以拆解50 & 38 - 34 = 2 + 32,得到"1,5"。
我建议在客户端代码中使用OR运算来构建数字。例如,PHP可以使用以下代码:
$bits = 0;
foreach (explode("1,4,5") as $b) {
$bits |= (1<<$b);
} // 结果为 $bits == 50
通过其他布尔运算,你可以(例如)计算一个数字中打开的位数,而另一个数字中没有的位数。
我目前正在开发一个"wordle"游戏,并使用一个26位的INT UNSIGNED
来表示单词中的字母。这导致了两个简单的布尔表达式,用于询问"这些字母是否都在单词中"和"那些字母是否都不在单词中"。ord($letter) - ord('A')
可以将字母A..Z映射为0..25。
此外,通过使用以下语句:
ORDER BY(BIT_COUNT(col, 38)) DESC
LIMIT 4
SELECT
语句将选择"1,2,5"的最佳4个匹配项。
英文:
If the numbers are between 1 and 10 (or some other small range), use the bits in a SMALLINT UNSIGNED
to represent the numbers. So, for "1,4,5", use the value
1<<1 | 1<<4 | 1<<5
That is: 2 | 16 | 32 = 50
Then for "similarity", count the number of bits that are the same. The set "1,2,5" is the number 38, so the number of bits in common is
SELECT BIT_COUNT(50 & 38)
is 2 (namely "1,5"). You could deconstruct 50 & 38 - 34 = 2 + 32 to discover "1,5".
I recommend doing the ORing to construct the number in client code. For example, PHP can use
$bits = 0;
foreach (explode("1,4,5") as $b) {
$bits |= (1<<$b);
} // results in $bits == 50
With other boolean operations, you could (for example) compute how many bits are on in one number that are not in the other.
I am currently developing a "wordle" game and using a 26-bit INT UNSIGNED
to say which letters are in a word. This leads to two simple boolean expressions to ask "are all of these letters in the word" and "are all of those letters not in the word". ord($letter) - ord('A')
gives me 0..25 for A..Z.
Furthermore, by using
ORDER BY(BIT_COUNT(col, 38)) DESC
LIMIT 4
the SELECT
will pick the "best 4 matches for "1,2,5".
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论