找到最长子字符串

huangapple go评论56阅读模式
英文:

Find longest substring

问题

给定一个字符串和单词列表,我想要找到最长的子字符串,使得它不包含在提供的单词列表中。

约束条件:

  • 字符串的长度为1到10^5,没有空格。
  • 单词列表的大小为1到10,没有空格,每个单词的长度在1到10的范围内。

示例:

s = "helloworld"
words = ["wor", "rld"]

解决方案:

有效的最长子字符串是: "hellowo",在这个子字符串中没有包含"wor"和"rld"。结果为7。

以下是我的代码,对于较小的输入效果良好:

int solve(String s, List<String> list) {
    int answer = 0, j = 0;
    for (int i = 0; i <= s.length(); i++) {
        String sub = s.substring(j, i);
        for (String e : list) {
            if (sub.contains(e)) {
                j = j + 1;
                sub = s.substring(j, i);
            }
        }
        answer = Math.max(answer, i - j);
    }
    return answer;
}

这段代码的时间复杂度估计为O(m*n^2),其中n是输入字符串的大小,m是列表的大小。

如何减少时间复杂度并使用更好的方法解决这个问题?

编辑: 根据meriton的解释,更新了时间复杂度。

英文:

Given a string and list of words, I want to find the longest substring such that it does not have any word present in the provided list of words.

Constraints:

The length of the string is 1 to 10^5, with no spaces
The size of the words list is 1 to 10, with no spaces, and each word length is in the range of 1 to 10.

Example:

s = &quot;helloworld&quot;
words = [&quot;wor&quot;, &quot;rld&quot;]

Solution :

Valid longest substring: hellowo , here we don&#39;t have &quot;wor&quot;, &quot;rld&quot; in this substring. 
Result = 7

Here is my code, which works fine for smaller inputs:

int solve(String s, List&lt;String&gt; list) {
	int answer = 0, j=0;
	for(int i = 0 ; i &lt;= s.length(); i++) {
		String sub = s.substring(j, i);
		for(String e : list) {
			if(sub.contains(e)){
				j = j+1;
				sub = s.substring(j, i);
			}
		}
		answer = Math.max(answer, i-j);
	}
	return answer;
}

The time complexity for this code is I assume O(m*n^2) where n is the size of the input string and m is the size of the list.

How to reduce time complexity and solve this with a better approach.

Edit: Updated the time complexity based on meriton explanation.

答案1

得分: 1

你的复杂性分析是错误的,因为 substringcontains 不是恒定时间操作,而是取决于涉及的字符串长度。

假设输入字符串不包含任何搜索词。那么你将会做长度为1、2、3、...、n的子字符串,并在每个子字符串上调用 contains 操作 m 次,总复杂度为 O(mn^2)...

要获得实际的 O(mn) 算法,我会这样做:

var answer = 0;
var j = 0;
for (int i = 0; i < s.length; i++) {
    while (endsWith(s, j, i, words)) j++;
    answer = Math.max(answer, i - j);
}

其中

/** @return 是否 s[begin..end] 以 words 中的一个词结尾 */
boolean endsWith(String s, int begin, int end, List<String> words) {
    for (var word : words) {
        if (endsWith(s, begin, end, word)) {
            return true;
        }
    }
    return false;
}

boolean endsWith(String s, int begin, int end, String word) {
    var offset = end - word.length;
    if (offset < begin) return false;
    for (int i = 0; i < word.length; i++) {
        if (s.charAt(offset + i) != word.charAt(i)) return false;
    }
    return true;
}

(这段代码未经测试,可能需要修复一些小错误,但总体思路应该是正确的)

你能分享一下这个解决方案的时间复杂度吗?我仍然认为它是 O(m*n^2)。

endsWithO(m),并且最多被调用 2n 次,因为每次调用会增加 ij,而每个都最多增加 n 次。

英文:

Your complexity analysis is wrong, because substring and contains are not constant time operations, but depend upon the length of the strings involved.

Suppose the input string doesn't contain any search word. Then, you'll do substrings of length 1, 2, 3, ..., n, and call contains on each of them m times, for a total complexity of O(mn^2) ...

To get an actual O(mn) algorithm, I'd do something like this:

var answer = 0;
var j = 0;
for (int i = 0; i &lt; s.length; i++) {
    while (endsWith(s, j, i, words) j++;
    answer = Math.max(answer, i - j);
}

where

/** @return whether s[begin..end] ends with a word in words */
boolean endsWith(String s, int begin, int end, List&lt;String&gt; words) {
    for (var word : words) {
        if (endsWith(s, begin, end, word)) {
            return true;
        }
    }
}

boolean endsWith(String s, int begin, int end, String word) {
    var offset = end - word.length;
    if (offset &lt; begin) return false;
    for (int i = 0; i &lt; word.length; i++) {
        if (s.charAt(offset + i) != word.charAt(i)) return false;
    }
    return true;
}

(this code is untested, you may have to fix minor bugs, but the general idea should be solid)

> Can you please share time complexity of this solution, I still see it as O(m*n^2)

endsWith is O(m), and invoked at most 2n times, because each invocation increases i or j, each of which is increased at most n times.

答案2

得分: 1

我相信可以使用O(N * max_word_length)(至少在单词数量上没有多项式依赖)的时间复杂度来解决,使用trie数据结构:

public class SO76451768 {

    public int solve(String s, List<String> list) {
        Trie revert = new Trie(true);
        list.forEach(revert::add);

        char[] chars = s.toCharArray();

        int ans = 0;
        int start = 0;
        int end = 1;

        while (end <= chars.length) {
            int r = revert.matchLen(chars, start, end);
            if (r == 0) {
                ans = Math.max(ans, end - start);
                end++;
            } else {
                start = Math.max(start + 1, end - r + 1);
                end = start + 1;
            }
        }

        return ans;
    }

    static class Trie {

        final boolean revert;

        Trie[] children = new Trie[26];

        boolean eow;

        public Trie(boolean revert) {
            this.revert = revert;
        }

        int matchLen(char[] c, int from, int to) {
            Trie child;
            Trie[] arr = children;
            for (int i = 0; i < to - from; i++) {
                child = arr[c[revert ? to - i - 1 : i + from] - 'a'];
                if (child == null) {
                    return 0;
                }
                if (child.eow) {
                    return i + 1;
                }
                arr = child.children;
            }
            return 0;
        }

        void add(String key) {
            Trie child = null;
            Trie[] arr = children;
            char[] chars = key.toCharArray();
            for (int i = 0; i < chars.length; i++) {
                char c = chars[revert ? chars.length - i - 1 : i];
                child = arr[c - 'a'];
                if (child == null) {
                    child = new Trie(revert);
                    arr[c - 'a'] = child;
                }
                arr = child.children;
            }
            child.eow = true;
        }
    }

}
英文:

I believe it is solvable with O(N * max_word_length) (at least there no polynomial dependency on word count) TC using trie data structure:

public class SO76451768 {

    public int solve(String s, List&lt;String&gt; list) {
        Trie revert = new Trie(true);
        list.forEach(revert::add);

        char[] chars = s.toCharArray();

        int ans = 0;
        int start = 0;
        int end = 1;

        while (end &lt;= chars.length) {
            int r = revert.matchLen(chars, start, end);
            if (r == 0) {
                ans = Math.max(ans, end - start);
                end++;
            } else {
                start = Math.max(start + 1, end - r + 1);
                end = start + 1;
            }
        }


        return ans;
    }


    static class Trie {

        final boolean revert;

        Trie[] children = new Trie[26];

        boolean eow;

        public Trie(boolean revert) {
            this.revert = revert;
        }

        int matchLen(char[] c, int from, int to) {
            Trie child;
            Trie[] arr = children;
            for (int i = 0; i &lt; to - from; i++) {
                child = arr[c[revert ? to - i - 1 : i + from] - &#39;a&#39;];
                if (child == null) {
                    return 0;
                }
                if (child.eow) {
                    return i + 1;
                }
                arr = child.children;
            }
            return 0;
        }

        void add(String key) {
            Trie child = null;
            Trie[] arr = children;
            char[] chars = key.toCharArray();
            for (int i = 0; i &lt; chars.length; i++) {
                char c = chars[revert ? chars.length - i - 1 : i];
                child = arr[c - &#39;a&#39;];
                if (child == null) {
                    child = new Trie(revert);
                    arr[c - &#39;a&#39;] = child;
                }
                arr = child.children;
            }
            child.eow = true;
        }
    }

}

答案3

得分: 1

Step 1: 在前缀树(也称为Trie)中存储单词

我们可以按照以下方式在前缀树(Trie)中存储单词:根节点指向多达26个节点,每个节点代表存储的单词的第一个字母。然后,每个节点指向多达26个后代节点,每个节点代表根节点到当前节点的前缀后面可能跟随的字母。

示例:我们想要存储CAT,CAN,COT,ART,ARE

               根
              /    
             C        A
           / \        |
         A   O      R
       / \  |      /
     N   T T    E   T

将单词列表放入前缀树中。如果一个单词是另一个单词的扩展(例如,catatonic与cat),则省略较长的单词。也就是说,在创建树时,如果一个单词以具有后代的字母结尾,则从树中删除所有后代。如果一个单词扩展到叶子节点之外,忽略它。

这一步的时间复杂度是O(所有被禁用单词的字母总数)。

Step 2: 查找输入字符串中禁用单词的起始和结束索引

维护一个指向前缀树中字母的引用/指针的链表,最初为空。我将这些指针称为下面的指针。

首先,举个例子:假设输入字符串是CART,禁用的单词如上所述是CAN,CAT,COT,ARE,ART。

    解析C:创建一个指向根节点的'C'子节点的新指针
    解析A:创建一个指向根节点的'A'子节点的新指针
          更新指向'C'的指针以指向它的子节点'A'
    解析R:我们不创建新指针,因为没有以'R'开头的禁用单词
          删除没有'R'子节点的'A'指针
          更新指向另一个'A'以指向其'R'子节点
    解析T:我们不创建新指针,因为没有以'R'开头的禁用单词
          更新唯一剩下的指针以指向它的'T'子节点。注意,这是距离根节点3个节点的叶子节点,因此是长度为3的禁用单词的末尾。存储起始和结束索引以备后用。

请注意,每个指针始终指向当前字母。在每个步骤中,指针要么:
 1. 移动到与新字母匹配的子节点
 2. 如果没有与新字母匹配的子节点,则删除它
 3. 如果根节点有与新字母匹配的子节点,创建它(即,如果以新字母开头的禁用单词)

解析输入字符串。对于每个字母:
 - 对于指针列表中的每个指针,要么1)如果当前字母不是后代,则删除它,或者指向适当的后代,如果当前字母是后代。
 - 如果可能的话,创建一个指向当前字母的指针,作为一个单词的第一个字母。
 - 如果到达一个单词的末尾,那么字符串中的最后几个字符组成一个禁用单词。字符数是前缀树的深度。记下每个单词的起始和结束位置。

这一步的时间复杂度是O(输入字符串长度 * 平均指针数)。我不确定你可能会同时拥有多少个活动指针。

Step 3: 查找不包含任何禁用单词的起始和结束索引的最长子字符串

一旦完成这一步,您可以在线性的额外时间内使用它来查找不包含禁用单词的最长子字符串。

英文:

Step 1: store words in a prefix tree (aka trie)

We can store words in a prefix tree (aka trie) as follows: The root node points to up to 26 nodes, one for the first letter of each word being stored. Then, each node points to up to 26 descendants, one for each letter than can follow the prefix from the root to the current node.

Example: we want to store CAT, CAN, COT, ART, ARE

           root
/    \
C      A
/ \     |
A   O    R
/ \  |   / \
N   T T  E   T

Put the list of words in a prefix tree. If a word is ever an extension of another word (e.g. catatonic vs cat), omit the longer word. I.e., in creating your tree, if a word ever ends on a letter with descendants, delete all descendants from the tree. If a word ever extends past a leaf, ignore it.

This step is O(sum of letter counts of forbidden words)

Step 2: find the starting and ending indices of banned words in the input string

Maintain a linked list of references/pointers to letters in the prefix tree, initially empty. I'll call these pointers below.

First, an example: Say the input string is CART, and the banned words are CAN, CAT, COT, ARE, ART as above.

parse C: create a new pointer to the &#39;C&#39; child of the root
parse A: create a new pointer to the &#39;A&#39; child of the root
update the &#39;C&#39; pointer to point to its child &#39;A&#39;
parse R: we don&#39;t create a new pointer since no banned word starts with &#39;R&#39;
delete the pointer to the &#39;A&#39; that has no &#39;R&#39; child
update the pointer to the other &#39;A&#39; to point to its R child
parse T: we don&#39;t create a new pointer since no banned word starts with &#39;R&#39;
update the only remaining pointer to point to its &#39;T&#39; child. Notice that this is a leaf 3 down from the root, so the end of a length 3 banned word. Store the pair of indices for later use.

Note that every single pointer will always point to the current letter. On every step, a pointer either:

  1. moves to its child that matches the new letter
  2. is deleted if no child matches the new letter
  3. is created if the root has a child matching the new letter (i.e. if a banned word starts with the new letter)

Parse the input string. For each letter:

  • for each pointer in your list of pointers, either 1) delete it if the current letter isn't a descendant, or point to the appropriate descendant if the current letter is a descendant.
  • create a pointer to the current letter as the first letter of a word, if possible.
  • if you reach the end of a word, then the last few characters in the string form a forbidden word. The character count is the depth in the prefix tree. Note the index where each word starts and ends.

This step is O((length of input string) * (average number of pointers)). I'm not sure how many pointers you're likely to have active at once.

Step 3: find the longest substring that doesn't contain the starting and ending indices of any banned word

Once this is done, in linear additional time you can use this to find the longest substring that doesn't contain a forbidden word.

huangapple
  • 本文由 发表于 2023年6月12日 01:52:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76451768.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定