JavaScript部分字符串模糊匹配?

huangapple go评论78阅读模式
英文:

Javascript partial string fuzzy match?

问题

I understand that you want a Chinese translation for the provided text related to comparing strings using JavaScript libraries for fuzzy matching. Here's the translation:

假设我需要将两个字符串与一段参考文本进行比较:

参考文本:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

搜索文本 1:

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
(与参考文本中的1句完全匹配)

搜索文本 2:

我使用LLM(律师、说谎者或经理)模型来确定根据用户的语气和词语选择如何回应用户输入。如果用户的语气和词语选择表明他们有法律问题,我将转介他们给律师。如果用户的语气和词语选择表明他们在撒谎,我将揭穿他们并鼓励他们诚实。如果用户的语气和词语选择表明他们有经理问题,我将提供指导和支持。
(与参考文本完全不同)

我需要一种方法或库,可以确信告诉我搜索文本 1 与参考文本更相似,因为实际上有一句与之完全匹配。

不幸的是,似乎大多数流行的 JavaScript 字符串相似性库在比较两个长度差异很大的字符串时无法识别相似性。它们都会错误地得出结论,搜索文本 1 和 2 都与参考文本同样相似:

var stringSimilarity = require("string-similarity");
var levenshtein = require('fast-levenshtein');
var similarity = require('similarity');

stringSimilarity.compareTwoStrings(ref, text1); //0.3
stringSimilarity.compareTwoStrings(ref, text2); //0.359
levenshtein.get(ref, text1); //338
levenshtein.get(ref, text2); //379
similarity(ref, text1); //0.24
similarity(ref, text2); //0.24

是否有更好的库可以实现像上面所述的部分字符串模糊匹配?

英文:

Imagine I need to compare two strings to one reference text:

Reference Text:

> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
> eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
> minim veniam, quis nostrud exercitation ullamco laboris nisi ut
> aliquip ex ea commodo consequat. Duis aute irure dolor in
> reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
> pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
> culpa qui officia deserunt mollit anim id est laborum.

Search Text 1:

> Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
> nisi ut aliquip ex ea commodo consequat.
(Matching 1 sentence in the reference exactly)

Search Text 2:

> I use the LLM (Lawyer, Liar, or Manager) model to determine how to
> respond to user input based on their tone and word choice. If the
> user's tone and word choice indicate that they are expressing a legal
> concern, I will refer them to a lawyer. If the user's tone and word
> choice indicate that they are lying, I will call them out on it and
> encourage them to be honest. If the user's tone and word choice
> indicate that they are expressing a managerial concern, I will offer
> them guidance and support.
(Completely different from reference)

I need a way or a library that can confidently tell me that search text 1 is way more similar to reference text than search text 2, since there is actually one sentence that matches it exactly.

Unfortunately, it appears that most of the popular javascript string similarity libraries fails to identify similarities when two comparing strings have very different length. They will all incorrectly conclude that both search texts 1 and 2 are equally similar to the reference:

var stringSimilarity = require("string-similarity");
var levenshtein = require('fast-levenshtein');
var similarity = require('similarity')

stringSimilarity.compareTwoStrings(ref, text1); //0.3
stringSimilarity.compareTwoStrings(ref, text2); //0.359
levenshtein.get(ref, text1); //338
levenshtein.get(ref, text2); //379
similarity(ref, text1); //0.24
similarity(ref, text2); //0.24

Is there a better library that can achieve partial string fuzzy match like above?

答案1

得分: 1

以下是翻译好的内容:

不知道您可能想处理的所有要求和边缘情况,无法建议完美的解决方案。但如果您愿意使用直接的蛮力方法来比较相似位置上相同的单词,这是一个选择:

const text1 = `Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.`;

const text2 = `Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.`;

const text3 = `我使用LLM(律师、说谎者或经理)模型来根据用户的语气和词语选择确定如何回应用户输入。如果用户的语气和词语选择表明他们有法律上的顾虑,我将引荐他们咨询律师。如果用户的语气和词语选择表明他们在撒谎,我将揭发他们,并鼓励他们诚实。如果用户的语气和词语选择表明他们有管理上的顾虑,我将提供指导和支持。`;

const text4 = `Ut bla bla enim garbage ad minim bla veniam, quis bla bla nostrud exercitation more garbage ullamco labori bla nisi ut aliquip ex bla ea commodo bla consequat。`;


const compare = (a, b) => {
  const ax = a.replace(/[^A-Za-z0-9]/g, ' ')
    .split(' ')
    .map(s => s.toLowerCase())
    .filter(s => s);
  const bx = b.replace(/[^A-Za-z0-9]/g, ' ')
    .split(' ')
    .map(s => s.toLowerCase())
    .filter(s => s);
    
  let similar = 0;
  for (let ia = 0; ia < ax.length; ia ++) {
    for (let ib = 0; ib < bx.length; ib ++) {
      if (ax[ia] === bx[ib]) {
        ia ++;
        similar ++;
      }
    }
  }
  return similar
    ? (similar / ax.length + similar / bx.length) / 2
    : 0;
};

console.log(compare(text1, text2));
console.log(compare(text1, text3));
console.log(compare(text2, text3));
console.log(compare(text2, text4));
console.log(compare(text2, text2));
英文:

Without knowing all the requirements and edge cases you may want to handle, it is impossible to suggest a perfect solution. But if you are fine with a straight brute force comparing the same words at similar positions, here is an option:

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-js -->

const text1 = `Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.`;

const text2 = `Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.`;

const text3 = `I use the LLM (Lawyer, Liar, or Manager) model to determine how to respond to user input based on their tone and word choice. If the user&#39;s tone and word choice indicate that they are expressing a legal concern, I will refer them to a lawyer. If the user&#39;s tone and word choice indicate that they are lying, I will call them out on it and encourage them to be honest. If the user&#39;s tone and word choice indicate that they are expressing a managerial concern, I will offer them guidance and support.`;

const text4 = Ut bla bla enim garbage ad minim bla veniam, quis bla bla nostrud exercitation more garbage ullamco labori bla nisi ut aliquip ex bla ea commodo bla consequat.;

const compare = (a, b) =&gt; {
  const ax = a.replace(/[^A-Za-z0-9]/g, &#39; &#39;)
    .split(&#39; &#39;)
    .map(s =&gt; s.toLowerCase())
    .filter(s =&gt; s);
  const bx = b.replace(/[^A-Za-z0-9]/g, &#39; &#39;)
    .split(&#39; &#39;)
    .map(s =&gt; s.toLowerCase())
    .filter(s =&gt; s);
    
  let similar = 0;
  for (let ia = 0; ia &lt; ax.length; ia ++) {
    for (let ib = 0; ib &lt; bx.length; ib ++) {
      if (ax[ia] === bx[ib]) {
        ia ++;
        similar ++;
      }
    }
  }
  return similar
    ? (similar / ax.length + similar / bx.length) / 2
    : 0;
};

console.log(compare(text1, text2));
console.log(compare(text1, text3));
console.log(compare(text2, text3));
console.log(compare(text2, text4));
console.log(compare(text2, text2));

<!-- end snippet -->

huangapple
  • 本文由 发表于 2023年5月26日 10:17:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76337269.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定