2023年5月22日 11:38:11go评论81阅读模式

英文:

"similarity()" query score is pretty low for search text almost matching

问题

select similarity('GTudH', 'tud'),
       similarity('GTudH', 'gtu'),
       similarity('GTudH', 'gdh')

在上面的Postgres查询结果中，第一个相似度分数只有0.1，但第三个却有0.25。

在Snowflake数据库中进行了相同的测试，结果看起来是正确的：

JAROWINKLER_SIMILARITY('GTUDH','TUD')	JAROWINKLER_SIMILARITY('GTUDH','GTU')	JAROWINKLER_SIMILARITY('GTUDH','GDH')
86	90	51

英文:

select similarity(&#39;GTudH&#39;, &#39;tud&#39;),
       similarity(&#39;GTudH&#39;, &#39;gtu&#39;),
       similarity(&#39;GTudH&#39;, &#39;gdh&#39;)

Above is the query result from Postgres. Can someone explain why the first similarity score is only 0.1, but the third one is 0.25?

Tested the same with snowflake database, which seems like alright:

JAROWINKLER_SIMILARITY('GTUDH','TUD')	JAROWINKLER_SIMILARITY('GTUDH','GTU')	JAROWINKLER_SIMILARITY('GTUDH','GDH')
86	90	51

答案1

得分: 1

这是因为这些函数不同。PostgreSQL的similarity()函数是pg_trgm扩展的一部分，它根据单词中的三字母序列“trigrams”计算相似度。虽然实际算法略有不同（请参阅源代码），但如果两个单词共享许多三字母序列，则被视为相似。

让我们来看看GTudH的三字母序列：

SELECT show_trgm('GTudH');

            show_trgm            
═════════════════════════════════
 {"  g"," gt","dh ","gtu,tud,udh}
(1 row)

pg_trgm在单词开头添加两个空格并追加一个空格，原因是单词开头的相似度被视为更重要。

因此，GTudH与tud共享一个三字母序列，与gtu共享三个三字母序列，与gdh共享两个三字母序列，这解释了不同的结果。

“相似度”不是一个清晰的概念，有许多不同的定义方式。

英文:

That's because the functions are different. PostgreSQL's similarity() function is part of the pg_trgm extension and calculates similarity on the basis of “trigrams” – sequences of three letters that occur in the word. Although the actual algorithm is slightly different (see the source), two words are considered similar if they share many trigrams.

Let's look at the trigrams of GTudH:

SELECT show_trgm(&#39;GTudH&#39;);
            show_trgm            
═════════════════════════════════
 {&quot;  g&quot;,&quot; gt&quot;,&quot;dh &quot;,gtu,tud,udh}
(1 row)

pg_trgm prepends two spaces and appends one, for the reason that similarity at the beginning of the word is counted as more significant.

So GTudH shares one trigram with tud, three trigrams with gtu and two trigrams with gdh, which explains the different results.

“Similarity” is not a clear-cut concept, and there are many different ways to define it.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

“similarity()”查询得分对于几乎匹配的搜索文本非常低。

问题

答案1

我遇到了 SQLC 生成连接和数组的错误。

为什么这个用于sqlx的复制语句卡住了？

如何在Ubuntu 18.04上更改Postgres版本

如何将参数传递给 db.exec 函数？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。