2023年5月25日 19:10:02go评论130阅读模式

英文:

Extract wildcard string (* asterisk only) from a set of strings/texts?

问题

Here's the translation of the text you provided:

有没有一种算法或者代码库（最好是NodeJS），可以从一组字符串或文本中提取通配符字符串。

例如，以下是输入的一组字符串（请原谅我的英语）：

apple pie is not too so good
apple pie is not so good
apple pie is too good

我应该能够仅提取一个带有星号*的通配符字符串，就像这样：

apple pie is * good

通配符中还有其他特殊字符，但我只想使用星号*。

可以这样说，我需要从输入字符串中提取正则表达式。现在是否有一个库或算法可以实现这个目标？

如果需要更多信息，请告诉我。

英文:

Is there an algorithm or code/lib (NodeJS preferably), to extract wildcard string from a set of strings or texts.

For example, following are the set of strings as input (forgive my english):

apple pie is not too so good
apple pie is not so good
apple pie is too good

I shall be able to extract a wildcard string with asterisk * only, like this:

apple pie is * good

There are other special characters in wildcard, but I want to use only * asterisk.

In a way we can say that I need to extract regex from the input strings. Now is there a library or algorithm to achieve this?

Please let me know if more information required.

答案1

得分: 3

I will provide translations for the non-code portions of the text:

"I'm going to make the assumption that you want to extract a single wildcard that will work for all your strings. If not, I don't know how to get started."

"我会假设您想要提取一个适用于所有字符串的通配符。如果不是这样，我不知道从哪里开始。"

"Also, for this first pass, I will correct your English, changing 'apple pie is not too so good' to 'apple pie is not very good'. The reason will be clear below."

"此外，对于这第一遍，我会更正您的英语，将'apple pie is not too so good'更正为'apple pie is not very good'。原因将在下面变得清晰。"

"To solve this, we can break it down to finding a common prefix to all the strings, and a common suffix to them all, and joining these together with our '*' wildcard character. It might look like this:"

"为了解决这个问题，我们可以将其分解为找到所有字符串的共同前缀和它们的共同后缀，然后使用我们的'*'通配符字符将它们连接在一起。它可能会看起来像这样："

"The extra 'o' next to the wildcard is unwanted. To fix that, we might make a more complex version of 'extractWildcard' that deals with spaces:"

"通配符旁边多出的'o'是不希望的。为了修复这个问题，我们可以制作一个处理空格的更复杂版本的'extractWildcard'："

"Here is a more efficient version of 'commonSuffix':"

"这是'commonSuffix'的更高效版本："

"I would stick with the simpler one unless this proved to be an actual bottleneck in an application, though."

"除非在应用程序中明确成为瓶颈，否则我会坚持使用更简单的版本。"

"Update 2: Simplicity"

"更新2：简单性"

"A comment said:"

"有评论说："

"I would like to argue this point."

"我想要辩论这一观点。"

"Update 3: White-space style"

"更新3：空格风格"

"A comment found that my white-space styling detracts from readability."

"有评论发现我的空格风格影响了可读性。"

"Just for comparison, here's the same code in a more traditional formatting:"

"仅供比较，这是相同的代码，使用更传统的格式化方式："

英文:

I'm going to make the assumption that you want to extract a single wildcard that will work for all your strings. If not, I don't know how to get started.

Also, for this first pass, I will correct your English, changing "apple pie is not too so good" to "apple pie is not very good". The reason will be clear below.

To solve this, we can break it down to finding a common prefix to all the strings, and a common suffix to them all, and joining these together with our * wildcard character. It might look like this:

const reverse = (s) =&gt; [...s] .reverse () .join (&#39;&#39;)
const commonPrefix = ([first, ...rest]) =&gt;
  first .slice (0, [... first] .findIndex ((c, i) =&gt; rest .some (s =&gt; s .at (i) !== c)))
const commonSuffix = (ss) =&gt;
  reverse (commonPrefix (ss .map (reverse)))
const extractWildcard = (str) =&gt; 
  commonPrefix (str) + &#39;*&#39; + commonSuffix (str)
console .log (extractWildcard ([
  &#39;apple pie is not very good&#39;, 
  &#39;apple pie is not so good&#39;, 
  &#39;apple pie is too good&#39;
])) //=&gt; &#39;apple pie is * good&#39;

Our commonPrefix function finds the first index at which some string is different from the first one, then slices the string up to that point. commonSuffix uses a string reverse helper function and reverses all the strings, finds their common prefix and reverses that value. (There are many other ways of doing this, some of them much more efficient, but they involve calculating the longest string to start, and are all more complex. If there was a performance problem here, I would look to use one of them instead.)

But there is a problem here, and it's shown by using your original "apple pie is not too so good":

extractWildcard ([
  &#39;apple pie is not too so good&#39;, 
  &#39;apple pie is not so good&#39;, 
  &#39;apple pie is too good&#39;
]) //=&gt; &quot;apple pie is *o good&quot;

The extra o next to the wildcard is unwanted. To fix that, we might make a more complex version of extractWildcard that deals with spaces:

const extractWildcard2 = (s, pre = commonPrefix (s), suf = commonSuffix (s)) =&gt; 
   pre .slice (0, pre .lastIndexOf (&#39; &#39;)) + &#39; * &#39; + suf .slice (suf .lastIndexOf (&#39; &#39;) + 1)
extractWildcard2 ([
  &#39;apple pie is not too so good&#39;, 
  &#39;apple pie is not so good&#39;, 
  &#39;apple pie is too good&#39;
]) //=&gt; &quot;apple pie is * good&quot;

Update

Here is a more efficient version of commonSuffix:

const commonSuffix = (ss, length = Math .max (...ss .map (s =&gt; s .length))) =&gt;
  (ss [0] || &#39;&#39;) .slice (length - Array .from ({length}, (_, i) =&gt; i + 1) .findIndex (
    (i) =&gt; console .log ({i}) || ss .some (s =&gt; s .at (-i) !== ss [0] .at (-i))
  ))

I would stick with the simpler one unless this proved to be an actual bottleneck in an application, though.

Update 2: Simplicity

(This is not about the actual problem, but a discussion on code simplicity; please feel free to ignore it.)

A comment said:

> It is pretty much unreadable code even for me [...]. simpler, more readable algorithms do exist for common prefix.

I would like to argue this point. Rich Hickey has a famous talk, Simple Made Easy, which distinguishes "simple" from "easy". Simple is a fairly objective measure about how few concepts are twined together to provide a solution. Easy is a subjective measure of how familiar the solution seems to a particular person.

By writing declarative code, by avoiding techniques like mutable state, by using expressions rather than statements, and by avoiding temporal effects of variable assignment, I'd argue that this code is fairly simple. It may not be familiar, and therefore may be less easy for some users.

I wrote the code above as it stands; I didn't start somewhere and whittle it down to a minimum. But we can try that. We can start with the most imperative -- and to some, most familiar -- style of code:

const commonPrefix = (strings) =&gt; {
  const firstString = strings [0];
  const first = firstString .split (&#39;&#39;);
  const rest = strings .slice (1);
  let index = -1;
  let indexFound = false;
  for (let i = 0; i &lt; first .length; i ++) {
    const character = first [i];
    for (let j = 0; j &lt; rest .length; j ++) {
       const testString = rest [j];
       const testCharacter = testString [i]
       if (testCharacter !== character) {
         index = i;
         indexFound = true;
         break;
       }
    }
    if (indexFound) {
      break;
    }
  }
  const characters = first .slice (0, index); 
  const result = characters .join (&#39;&#39;);
  return result;
}

This may be more familiar/easy to some. But let's look at the implementation. We have twelve local variables. We use a pair of nested loops, each including an if statement. We perform early escapes in both loops in certain circumstances.

If we do a slight bit of cleanup, and recognize that the outer loop is just doing the job that Array.prototype.findIndex already does for us, then we can write what I would argue is a simpler version:

const commonPrefix = (strings) =&gt; {
  const firstString = strings [0]
  const first = firstString .split (&#39;&#39;)
  const rest = strings .slice (1)
  const index = first .findIndex ((c, i) =&gt; {
    for (let j = 0; j &lt; rest .length; j ++) {
       const testString = rest [j]
       if (testString [i] !== c) {
         return true
       }
    }
    return false
  })
  return first .slice (0, index) .join (&#39;&#39;)
}

We've also eliminated some local variables: testCharacter, characters, and result, with no detriment in readability.

Once again, we should be able to notice that the remaining loop is performing the same job as Array.prototype.some. We can clean that up to leave:

const commonPrefix = (strings) =&gt; {
  const first = strings [0] .split (&#39;&#39;)
  const rest = strings .slice (1)
  const index = first .findIndex ((c, i) =&gt; rest .some (s =&gt; s .at (i) !== c))
  return first .slice (0, index) .join (&#39;&#39;)
}

We're now approaching what I think of as a simple function. But we can notice that index is defined in one line and used only once, in the very next line, so if we choose, we can simply fold it in, and remove the local variable. Similarly, the parameter strings is only used to define first and rest; we can achieve the same with parameter destructuring, and skip these definitions. At this point, we get back to my initial version:

const commonPrefix = ([first, ...rest]) =&gt;
  first .slice (0, [... first] .findIndex ((c, i) =&gt; rest .some (s =&gt; s .at (i) !== c)))

Note that this version doesn't use first .split (''), but instead [... first]. Either will work; and either seems equally simple. This code uses s .at (i) instead of s [i], since an earlier verion had a parallel commonSuffix where the at was much easier to use. But we could change that; either choice seems fine.

In all, this code is much simpler than the longest version. It's not the length itself which is important. But a common side-effect of simpler code is shorter code.

Update 3: White-space style

A comment found that my white-space styling detracts from readability. Just for comparison, here's the same code in a more traditional formatting:

const reverse = (s) =&gt; [...s].reverse().join(&#39;&#39;)
const commonPrefix = ([first, ...rest]) =&gt;
  first.slice(0, [... first].findIndex((c, i) =&gt; rest.some(s =&gt; s.at(i) !== c)))
const commonSuffix = (ss) =&gt;
  reverse(commonPrefix(ss.map(reverse)))
const extractWildcard = (str) =&gt; 
  commonPrefix(str) + &#39;*&#39; + commonSuffix(str)
console.log(extractWildcard([
  &#39;apple pie is not very good&#39;, 
  &#39;apple pie is not so good&#39;, 
  &#39;apple pie is too good&#39;
])) //=&gt; &#39;apple pie is * good&#39;

I find that over-dense, but I know many others prefer it.

答案2

得分: 3

假设输入的单词被视为原子且不应拆分。否则，可以保留更多原始字符的解决方案。

我将假设这些单词是由空格分隔而不是其他字符，因此第一步是将输入短语拆分为包含它们的单词的数组。

问题结构

您可以将其视为图问题，其中节点是已从每个短语处理的单词的组合状态，或者换句话说：这些短语中当前单词索引的元组。边是有向的、加权的，表示从一个状态到另一个状态的“成本”。成本包括每个通配符使用的单词的点数。如果有多个成本相同的解决方案，则优先选择具有较少通配符的解决方案。因此，我们可以为引入的每个通配符计算0.5成本。

例如：如果输入中的单词都不同，那么解决方案是“*”，这个解决方案的成本是单词的总数加上单个通配符的0.5。

另一个例子是当所有输入短语完全相同时，如4次“这是一个短语”，那么该短语也将是输出。该解决方案的成本为0，因为没有通过使用通配符省略任何单词。

最佳搜索算法

然后在此图上执行Dijkstra算法以找到解决方案，即所有单词索引都位于其相应短语的末尾的结束状态。

我们可以通过关注那些不需要连续使用两个通配符的搜索路径来消除一些搜索路径。因此，我们应该以这样一种方式增加索引，以便在使用通配符后，由更新后的索引引用的单词都是相同的。如果我们决定字面使用短语的当前单词（而不是通配符），我们应该在其他字符串中找到相同的单词（从它们的当前索引开始）。为了处理当这不可能时，或者仍然导致很高成本的情况，我们还应该考虑跳过所有当前单词的情况（所有索引增加1），使用通配符。

优先级队列

由于JavaScript没有本地优先级队列，我重新使用了来自this answer的堆实现。

实现

这是上述算法的实现：

//（略，代码太长）

此算法的最坏时间复杂度不佳，因此如果您提供更多和更长的短语，并有许多共同单词，它将变得太慢。

英文:

I'll assume that the words of the input are considered atomic and should not be broken up. Otherwise a solution for your example input could have kept more original characters in the output:

apple pie is *o*o good

I'll then also assume that these words are separated by spaces and not by other characters, and so a first step would be to split the input phrases into arrays with their words as elements.

Structure of the problem

You could see this as a graph problem, where the nodes are the combined state of which words have already been processed from each phrase, or otherwise put: the tuple of current word-indicies in those phrases. The edges are directed and weighted and represent the "cost" of progressing from one state to another. The cost includes a point for every word that is consumed by a wildcard. If there are multiple solutions with the same cost, then prefer the one(s) that have fewer wildcards. So we could account 0.5 cost for every wildcard that is introduced.

For example: if the words in the input are all different, then the solution is "*", and the cost of this solution is the total count of words plus 0.5 for the single wildcard.

Another example is where all input phrases are exactly the same, like 4 times "this is a phrase", and so that phrase will also be the output. The cost of that solution is 0, as no words were omitted by a using wildcard(s).

Best Search algorithm

Then perform Dijkstra's algorithm on this graph to find a solution, i.e. the end state where all word-indices are at the end of their corresponding phrases.

We can eliminate some search paths by focusing on those where no two consecutive wildcards will be needed. So we should aim to increment the indices in such a way that after a wildcard has been used, the words referenced by the updated indices are all the same. If we decide to use a phrase's current word literally (and not as wildcard), we should find the same word in the other strings (from their current index onwards). To cope for when this is not possible, or still leads to a high cost, we should also consider the scenario where all current words are skipped (all indices increment by 1), using a wildcard.

Priority Queue

As JavaScript has no native priority queue, I have reused the heap implementation from this answer.

Implementation

Here is the implementation of the above described algorithm:

// From https://stackoverflow.com/a/66511107/5459839
const MinHeap={siftDown(h,i=0,v=h[i]){if(i&lt;h.length){let k=v[0];while(1){let j=i*2+1;if(j+1&lt;h.length&amp;&amp;h[j][0]&gt;h[j+1][0])j++;if(j&gt;=h.length||k&lt;=h[j][0])break;h[i]=h[j];i=j;}h[i]=v}},heapify(h){for(let i=h.length&gt;&gt;1;i--;)this.siftDown(h,i);return h},pop(h){return this.exchange(h,h.pop())},exchange(h,v){if(!h.length)return v;let w=h[0];this.siftDown(h,0,v);return w},push(h,v){let k=v[0],i=h.length,j;while((j=(i-1)&gt;&gt;1)&gt;=0&amp;&amp;k&lt;h[j][0]){h[i]=h[j];i=j}h[i]=v;return h}};
function commonPatternFor(strings) {
    const makeKey = JSON.stringify;
    // When there is a tie in the number of removed words, the number of wildcards is minimised:
    const WILDCARD_COST = 0.5; 
    
    // Words are considered atomic:
    const phrases = strings.map(s =&gt; s.split(&quot; &quot;));
    // The end state to achieve:
    const end = phrases.map(s =&gt; s.length);
    // Heap has entries with [cost, current indexes, previous indexes, isLastStepWildCard]  
    const heap = [ [0, phrases.map(() =&gt; 0), null, WILDCARD_COST] ];
    // Register the optimal paths as linked list
    const comeFrom = new Map;
    
    
    while (true) { // Best-first Loop
        // Extract best state so far
        let [cost, curr, prev, wildcardCost] = MinHeap.pop(heap);
        const currKey = makeKey(curr);
        if (comeFrom.has(currKey)) continue; // Already visited this state
        comeFrom.set(currKey, prev);
        if (phrases.some((words, i) =&gt; curr[i] === words.length)) {
            // Some phrases have been fully accounted for. Count number of remaining words:
            const diff = phrases.reduce((acc, words, i) =&gt; acc + words.length - curr[i], 0);
            if (!diff) { // BINGO!
                // unwind solution
                const sol = [];
                while (true) {
                    const prev = comeFrom.get(makeKey(curr));
                    if (!prev) return sol.reverse().flat().join(&quot; &quot;);
                    const sliceArr = phrases[0].slice(prev[0], curr[0]);
                    const slice = sliceArr.join(&quot; &quot;);
                    if (phrases.every((words, i) =&gt; words.slice(prev[i], curr[i]).join(&quot; &quot;) === slice)) {
                        sol.push(sliceArr);
                    } else if (sol.at(-1) != &quot;*&quot;) {
                        sol.push(&quot;*&quot;);
                    }
                    curr = prev;
                }
            }
            // As some phrases have no more remaining words, we use a single wildcard for
            //    covering the remaining words in the other phrases
            MinHeap.push(heap, [cost + diff + wildcardCost, end, curr, 0]); 
        } else {
            // See if there are common words in all phrases starting at their current positions
            let k = 0;
            while (true) {
                const word = phrases[0][curr[0]+k];
                if (word === undefined || !phrases.every((words, i) =&gt; words[curr[i]+k] === word)) break;
                k++;
            }
            if (k) { // Yes, all phrases have k common words starting at their current positions
                MinHeap.push(heap, [cost, curr.map(i =&gt; i + k), curr, WILDCARD_COST]); 
                // Also consider alternative scenarios where we introduce a wildcard
                phrases.forEach((words, i) =&gt; {
                    const k = words.indexOf(words[curr[i]], curr[i] + 1);
                    if (k &lt; 0) return;
                    const diff = k - curr[i];
                    const next = [...curr];
                    next[i] = k;
                    MinHeap.push(heap, [cost + diff + wildcardCost, next, curr, 0]); 
                });
            } else { // Not all current words are the same: we must use a wildcard
                phrases.forEach((words, i) =&gt; {
                    const word = words[curr[i]]; // Try to find this word in the other phrases
                    let diff = 0;
                    const next = phrases.map((words2, j) =&gt; {
                        let k = words2.indexOf(word, curr[j]);
                        diff += k - curr[j];
                        return k;
                    });
                    if (!next.includes(-1) &amp;&amp; diff) {
                        MinHeap.push(heap, [cost + diff + wildcardCost, next, curr, 0]); 
                    }
                });
                // Also consider the scenario where a wildcard is used to cover for ALL current words
                MinHeap.push(heap, [cost + phrases.length + wildcardCost, curr.map(i =&gt; i + 1), curr, 0]); 
            }
        }
    }
}
const tests = [
    [
        &quot;apple pie is not very good&quot;, 
        &quot;apple pie is not so good&quot;, 
        &quot;apple pie is too good&quot;
    ], [
        &quot;apple pie is not very good and for me that is fine&quot;, 
        &quot;the apple pie is not as good as they told me&quot;, 
        &quot;apple pie is too good for me and you&quot;
    ], [
        &quot;this is such a very common old phrase this is a common phrase&quot;, 
        &quot;this is a common phrase while this is not common&quot;, 
        &quot;this is a common phrase&quot;,
    ], [
        &quot;a b c d e f g h&quot;,
        &quot;i j k l m n o p&quot;,
        &quot;q r s t u v w x&quot;,
        &quot;y z a b c d e f&quot;,
        &quot;g h i j k l m n&quot;,
        &quot;o p q r s t u v&quot;,
        &quot;w x y z a b c d&quot;,
    ], [
        &quot;a a a b a a b b a a b a&quot;,
        &quot;a a a a a b a b a a b b&quot;,
        &quot;a a b a a a b a b a a b&quot;
    ]
];
for (const test of tests) {
    const sol = commonPatternFor(test);
    console.log(sol);
}

This algorithm does not have a good worst-case time complexity, so if you feed it with more and longer phrases, with lots of common words, it will become too slow.

答案3

得分: 0

以下是您要翻译的内容：

这比这里提出的其他建议更简单，也不如其他建议稳健，但如果您只想提取星号位置的短语，这样做就可以：

var exp = '苹果派是*好的';
var input = `苹果派不太好
苹果派不太好
苹果派太好`;
function matchAsterisk(exp, input) {
    return input
        .replace(/\n/g, '')
        .split(new RegExp(exp.split('*').join('|'), 'g'))
        .filter(i => i);
}
console.log(matchAsterisk(exp, input))

英文:

This is much simpler, and less robust than other suggested solutions here, but if you only want to extract the phrases in the asterisk place, this will do:

var exp = &#39;apple pie is * good&#39;
var input = `apple pie is not too so good
apple pie is not so good
apple pie is too good`;
function matchAsterisk(exp, input){
    return input
        .replace(/\n/g, &#39;&#39;)
        .split(new RegExp(exp.split(&#39;*&#39;).join(&#39;|&#39;), &#39;g&#39;))
        .filter(i =&gt; i);
}
console.log(matchAsterisk(exp, input))

答案4

得分: 0

在Python中

import re
def find_values_between(text, start, end):
    pattern = f"{re.escape(start)}(.*?){re.escape(end)}"
    matches = re.findall(pattern, text, re.DOTALL)
    return matches
sentence = """
apple pie is amazingly good
apple pie is not good
apple pie is disgustingly good
extra line
"""
values = find_values_between(sentence, "apple pie is ", " good")
for value in values:
    print(value.strip())

输出:

amazingly
not
disgustingly

英文:

In python

import re
def find_values_between(text, start, end):
    pattern = f&quot;{re.escape(start)}(.*?){re.escape(end)}&quot;
    matches = re.findall(pattern, text, re.DOTALL)
    return matches
sentence = &quot;&quot;&quot;
apple pie is amazingly good
apple pie is not good
apple pie is disgustingly good
extra line
&quot;&quot;&quot;
values = find_values_between(sentence, &quot;apple pie is &quot;, &quot; good&quot;)
for value in values:
    print(value.strip())

Output:

amazingly
not
disgustingly

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

问题

答案1

Update

Update 2: Simplicity

Update 3: White-space style

答案2

问题结构

最佳搜索算法

优先级队列

实现

Structure of the problem

Best Search algorithm

Priority Queue

Implementation

答案3

答案4

发表评论