英文:
"similarity()" query score is pretty low for search text almost matching
问题
select similarity('GTudH', 'tud'),
similarity('GTudH', 'gtu'),
similarity('GTudH', 'gdh')
在上面的Postgres查询结果中,第一个相似度分数只有0.1,但第三个却有0.25。
在Snowflake数据库中进行了相同的测试,结果看起来是正确的:
JAROWINKLER_SIMILARITY('GTUDH','TUD') | JAROWINKLER_SIMILARITY('GTUDH','GTU') | JAROWINKLER_SIMILARITY('GTUDH','GDH') |
---|---|---|
86 | 90 | 51 |
英文:
select similarity('GTudH', 'tud'),
similarity('GTudH', 'gtu'),
similarity('GTudH', 'gdh')
Above is the query result from Postgres. Can someone explain why the first similarity score is only 0.1, but the third one is 0.25?
Tested the same with snowflake database, which seems like alright:
JAROWINKLER_SIMILARITY('GTUDH','TUD') | JAROWINKLER_SIMILARITY('GTUDH','GTU') | JAROWINKLER_SIMILARITY('GTUDH','GDH') |
---|---|---|
86 | 90 | 51 |
答案1
得分: 1
这是因为这些函数不同。PostgreSQL的similarity()
函数是pg_trgm
扩展的一部分,它根据单词中的三字母序列“trigrams”计算相似度。虽然实际算法略有不同(请参阅源代码),但如果两个单词共享许多三字母序列,则被视为相似。
让我们来看看GTudH
的三字母序列:
SELECT show_trgm('GTudH');
show_trgm
═════════════════════════════════
{" g"," gt","dh ","gtu,tud,udh}
(1 row)
pg_trgm
在单词开头添加两个空格并追加一个空格,原因是单词开头的相似度被视为更重要。
因此,GTudH
与tud
共享一个三字母序列,与gtu
共享三个三字母序列,与gdh
共享两个三字母序列,这解释了不同的结果。
“相似度”不是一个清晰的概念,有许多不同的定义方式。
英文:
That's because the functions are different. PostgreSQL's similarity()
function is part of the pg_trgm
extension and calculates similarity on the basis of “trigrams” – sequences of three letters that occur in the word. Although the actual algorithm is slightly different (see the source), two words are considered similar if they share many trigrams.
Let's look at the trigrams of GTudH
:
SELECT show_trgm('GTudH');
show_trgm
═════════════════════════════════
{" g"," gt","dh ",gtu,tud,udh}
(1 row)
pg_trgm
prepends two spaces and appends one, for the reason that similarity at the beginning of the word is counted as more significant.
So GTudH
shares one trigram with tud
, three trigrams with gtu
and two trigrams with gdh
, which explains the different results.
“Similarity” is not a clear-cut concept, and there are many different ways to define it.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论