英文:
String Similarity for all possible combination in Optimised fashion
问题
我在查找字符串相似性时遇到了问题。
场景:字符串由以下字段组成
名字,中间名和姓氏
我的任务是在A和B之间找到字符串相似性(两者具有相同的字段),但要确保考虑到所有可能性。
案例1:
假设字符串A的名字是:Rahul
中间名是:Kumar
姓氏是:空
而字符串B的名字是:Kumar
中间名是:空
姓氏是:Rahul
从外观上看,我们可以说这两个名字可能是相同的。但是当前的相似性算法给出的相似度约为71%。
案例2:
假设字符串B的名字是:Rahul
中间名是:空
姓氏是:K.
在这种情况下,相似度下降了,但是名字可能是相同的。
为了考虑到名字的所有可能组合并以优化的方式进行相似性计算,我应该怎么做?
例如,Rahul Rakesh Kumar可以写成:
Rahul Kumar
Rahul
Rahul K.
Kumar Rahul
Kumar Rakesh Rahul.
Rahul R. Kumar等等。
我尝试过使用Jaccard、Cosine、Jaro-Wrinklr相似性算法,但结果不令人满意。
注意:我必须基于姓名进行去重,所以我必须考虑到名字的所有可能性。
英文:
I am facing a problem while finding string similarity.
Scenario: The string which consisits of following fields
first_name, middle_name and last_name
What I have do is to find string similarity between A and B (both have same fields) but making sure all possibilities are considered.
Case 1:
say string A has first name is : Rahul
middle name is: Kumar
last name is : " "
And string B has first name : Kumar
middle name: " "
last name: Rahul
By seeing we can say both names might be same. But Current Similarity algorithms are giving around 71% similarity.
Case 2:
Say, string B has first name : Rahul
middle name: " "
last name: K.
In this case similarity %age falls down, but the name might be same.
What should I do to similarity ,considering all possibile combinations of first,middle and last name and in an optimised fashion?
E.g Rahul Rakesh Kumar can be written as
Rahul Kumar
Rahul
Rahul K.
Kumar Rahul
Kumar rakesh Rahul.
Rahul R. Kumar etc.
I had tried using Jaccard, Cosine, Jaro-Wrinklr similarity algorithms but result are not satisafactory.
Note: I have to find dedupe based on Names that's why I have to consider all possible of Names.
答案1
得分: 0
对于这种字符串相似性,我最近实现了一个算法,对于我的使用情况非常有效,与你的情况非常相似。这是博客文章链接:http://www.catalysoft.com/articles/strikeamatch.html
要比较两个字符串:
- 制作一个重叠字母对的列表。例如,"kumar" 给出的对是:"ku"、"um"、"ma"、"ar"。
- 假设第二个字符串是 "maria"。对是:"ma"、"ar"、"ri"、"ia"。
- 计算相同的对的数量:"ma" 和 "ar"。
- 相似性计算公式为:2 * 相同的对数 / (对数1 + 对数2) = 2*2/(4+4) = 0.5
这个算法的优点是它通过使用有序的字母对来考虑字符串的顺序。另一方面,它不考虑顺序,因为所有的对可以以任何顺序混合。这使得它非常适合我的特定用例,就像你的情况一样,我希望将 "abc def" 视为与 "def abc" 完全匹配。
你也可以修改这个算法。我使用它来在较长的文本中搜索一个短字符串。为此,我将使用相似度度量 相同的对数 / 对数1
,即搜索词中相同对的数量除以对数。如果一个单词出现在文本中,这将给我相似度为1.0,无论搜索文本有多长。上面描述的相似度在这种情况下会对较长的文本进行惩罚。
关于实现的另一个注意事项。你可以使它非常高效和健壮。我首先将字母表缩小为小写字母、数字和空格。这样有 26+10+1 = 37
个符号。我将 A
映射为 a
,ä
映射为 a
等等。现在可能的字母对数为 37*37 = 1369
。我通过使用一个数组 pairCounts [37*37]byte
来计算我的对数,其中对 aa
映射到 0
,对 ab
映射到 1
等等。所以如果我在字符串中找到对 ac
,ac
映射到索引 2
,我就说 pairCounts[2]++
。每个对可以以这种方式出现 256
次(我使用 byte
)。
通过规范化字母,使其不区分大小写,使用数组而不是完整的 map
,使其运行非常快速。
英文:
For this kind of string similarity I have recently implemented an algorithm that works great for my use-case, which is very similar to yours. Here is the blog post: http://www.catalysoft.com/articles/strikeamatch.html
To compare two strings:
1. Make a list of overlapping letter-pairs. "kumar" gives us pairs: "ku", "um", "ma", "ar".
2. Say the second string is "maria". Pairs: "ma", "ar", "ri", "ia".
3. Count how many pairs are the same: "ma" and "ar".
4. The similarity is computed as: 2 * same_pairs / (pairs_1 + pairs_2) = 2*2/(4+4) = 0.5
The great thing about this algorithm is that it considers the ordering of the strings by using letter-pairs, which are ordered. Then on the other hand it does not consider ordering, because all pairs can be mixed in any order. This makes it perfect for my specific use case, where similar to your situation, I want to treat "abc def" as exact match to "def abc".
You can also modify the algorithm. Part of my use-case is searching a short string in a longer text. For this I will use the similarity measure same_pairs / pairs_1
, number of same pairs over pair count in the search term. If a word appears in the text, this gives me similarity 1.0, no matter how long the search text is. The similarity described above will penalize longer texts in this case.
Another note on the implementation. You can make it really efficient and robust. I first reduced my alphabet to lower-case letters, digits and space. That is 26+10+1 = 37
symbols. I map A
to a
, ä
to a
etc. The number of possible letter pairs is now 37*37 = 1369
. The way I count my pairs is by using an array pairCounts [37*37]byte
where pair aa
maps to 0
, ab
maps to 1
etc. So if I find pair ac
in the string, ac
maps to index 2
, so I say pairCounts[2]++
. Each pair can appear 256
times (I use byte
) this way.
Normalizing the letters make it case-insensitive, using an array instead of a full-blown map
makes it run really fast.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论