最佳方法连接两个包含相似数据的表格。

huangapple go评论74阅读模式
英文:

Best way to Join 2 Tables with columns containing SIMILAR data

问题

我在尝试将两个表格连接在一起时遇到了问题,这两列数据相似但并非完全相同。

示例:

表格 1:列 1 = "预计今天天气会下雨"
表格 2:列 2 = "预计今天天气会下雨并伴有多云"

我尝试了以下方法进行连接,但并非百分之百准确:
... FROM [表格1] as [一个] LEFT JOIN [表格2] as [两个] ON [一个].[列1] LIKE ''%'' + [两个].[列2] + ''%''

模糊匹配是否是获取最准确匹配的最佳方法?或者还有其他方法可以连接具有相似但不完全相同数据的列的表格吗?

对任何协助和建议将不胜感激!

英文:

I am having trouble joining to tables together, the 2 columns have similar data but not exactly the same data.

Example:

Table 1: Column 1 = "Expect rain for todays weather"
Table 2: Column 2 = "Expect rain for todays weather and overcast clouds"

I have tried using the below to join but it is not 100% accurate:
...
FROM [Table1] as [one]
LEFT JOIN[Table2] as [two]
ON [one].[column1] LIKE '%' + [two].[column2] + '%'

Would fuzzy matching be the best way to get the most accurate matches? Or are there other methods of joining Tables with columns that have similar but not exact data?

Any assistance and advise will be greatly appreciated!

答案1

得分: 3

在您描述的情况下,如果其中一个值不完全是另一个值的子集,那么会变得很困难。

我使用了一个Damerau-Levenshtein Distance函数,但它实际上是用来衡量排版差异的。我用它来查找街道和郊区名称中的排版错误,因此它们是相当短的字符串。我不知道链接是否是我曾经尝试过的,但目前我使用的是为CLR编写的版本,速度要快得多。如果您有大量记录,长字符串上可能会太慢。

也许可以考虑使用全文索引和CONTAINS查询来查找相似的单词模式。

英文:

It is difficult in the situation you describe if one of the values in not exactly a subset of the other.

I use a Damerau-Levenshtein Distance function but it is really meant to measure typographical differences. I use it find typographical mistakes in street and suburb names so they are quite short strings. I don't know if the link is one I tried at one point but I currently use one written for CLR which is a lot faster. Probably too slow on long stings if you have a lot of records anyway.

Maybe consider using a Full Text Index and a CONTAINS query to find similar patterns of words.

答案2

得分: 0

只有文本内容需要翻译,不包括代码部分。以下是翻译后的文本:

刚刚我想到了一个问题,但我还没有进行研究,即可以使用整个单词而不仅仅是字符来计算 Damerau-Levenshtein 距离。

CONTAINS 或 FREETEXT(更好的选择是 CONTAINSTABLE 或 FREETEXTTABLE)查询仍然会用于缩小候选范围。使用这些全文索引解决方案的好处之一是能够自动索引词的语法变体,如复数形式和不同时态,同时可以指定短语中多个单词的接近程度,同时忽略像“the”和“a”等不重要的词。

我最初在 Access VBA 中实现了加权 Damerau-Levenshtein 函数的代码。它在此处,并包含了它基于的原始 Excel 版本的链接。将这段代码转换为 VB.NET 并将其用作 CLR 函数并不困难。

如果可以修改为单词而不是字符,那么它将根据单词的颠倒、插入和省略来评分,就像原始函数中基于字符评分一样。

英文:

It just occurred to me, but I have not investigated, that a Damerau-Levenshtein Distance might be able to be calculated using whole words rather than just characters.

The CONTAINS or FREETEXT (or better still CONTAINSTABLE or FREETEXTTABLE) query would still be used to narrow down the candidates. One of the benefits of using these full text indexing solutions is the ability to automatically index grammatical variants of words such as plurals and different tenses as well as specify the proximity of multiple words in the phrase while ignoring insignificant words like 'the' and 'a' etc.

I originally implemented the code for the Weighted Damerau-Levenshtein function in Access VBA. It is here and includes a link to the original Excel version it was based. It isn't difficult to convert this code to VB.NET to use as a CLR function.

If it could be modified for words rather than characters, then it would score differences on word reversals, insertions and omissions, as it does on the basis of characters in the original function.

huangapple
  • 本文由 发表于 2023年6月26日 12:57:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76553620.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定