“similarity()”查询得分对于几乎匹配的搜索文本非常低。

huangapple go评论51阅读模式
英文:

"similarity()" query score is pretty low for search text almost matching

问题

select similarity('GTudH', 'tud'),
       similarity('GTudH', 'gtu'),
       similarity('GTudH', 'gdh')

在上面的Postgres查询结果中,第一个相似度分数只有0.1,但第三个却有0.25。

在Snowflake数据库中进行了相同的测试,结果看起来是正确的:

JAROWINKLER_SIMILARITY('GTUDH','TUD') JAROWINKLER_SIMILARITY('GTUDH','GTU') JAROWINKLER_SIMILARITY('GTUDH','GDH')
86 90 51
英文:
select similarity('GTudH', 'tud'),
       similarity('GTudH', 'gtu'),
       similarity('GTudH', 'gdh')

Above is the query result from Postgres. Can someone explain why the first similarity score is only 0.1, but the third one is 0.25?

Tested the same with snowflake database, which seems like alright:

JAROWINKLER_SIMILARITY('GTUDH','TUD') JAROWINKLER_SIMILARITY('GTUDH','GTU') JAROWINKLER_SIMILARITY('GTUDH','GDH')
86 90 51

答案1

得分: 1

这是因为这些函数不同。PostgreSQL的similarity()函数是pg_trgm扩展的一部分,它根据单词中的三字母序列“trigrams”计算相似度。虽然实际算法略有不同(请参阅源代码),但如果两个单词共享许多三字母序列,则被视为相似。

让我们来看看GTudH的三字母序列:

SELECT show_trgm('GTudH');
            show_trgm            
═════════════════════════════════
 {"  g"," gt","dh ","gtu,tud,udh}
(1 row)

pg_trgm在单词开头添加两个空格并追加一个空格,原因是单词开头的相似度被视为更重要。

因此,GTudHtud共享一个三字母序列,与gtu共享三个三字母序列,与gdh共享两个三字母序列,这解释了不同的结果。

“相似度”不是一个清晰的概念,有许多不同的定义方式。

英文:

That's because the functions are different. PostgreSQL's similarity() function is part of the pg_trgm extension and calculates similarity on the basis of “trigrams” – sequences of three letters that occur in the word. Although the actual algorithm is slightly different (see the source), two words are considered similar if they share many trigrams.

Let's look at the trigrams of GTudH:

SELECT show_trgm('GTudH');

            show_trgm            
═════════════════════════════════
 {"  g"," gt","dh ",gtu,tud,udh}
(1 row)

pg_trgm prepends two spaces and appends one, for the reason that similarity at the beginning of the word is counted as more significant.

So GTudH shares one trigram with tud, three trigrams with gtu and two trigrams with gdh, which explains the different results.

“Similarity” is not a clear-cut concept, and there are many different ways to define it.

huangapple
  • 本文由 发表于 2023年5月22日 11:38:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76302906.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定