Is there an efficient way to detect if a string contains a substring which is in a large set of characteristic strings?

huangapple go评论92阅读模式
英文:

Is there an efficient way to detect if a string contains a substring which is in a large set of characteristic strings?

问题

例如,给定字符串 aaaaaaaaaXyz,我想要找出它是否包含在一个特征字符串集合 {'xy','xyz','zzz','cccc','dddd',....} 中的子字符串,该集合可能有一百万个成员。是否有一种高效的方法?

英文:

For example, given a string aaaaaaaaaXyz, I want to find out if it contains a substring which is in a characteristic string set {'xy','xyz','zzz','cccc','dddd',....}, which may have one million members. Is there an efficient way?

答案1

得分: 2

鉴于您的搜索集可能非常大,我建议只需迭代该集合并检查潜在的子字符串匹配:

  1. public boolean containsSubstring(String input, Set<String> subs) {
  2. boolean match = false;
  3. for (String sub : subs) {
  4. if (input.contains(sub)) {
  5. match = true;
  6. break;
  7. }
  8. }
  9. return match;
  10. }
英文:

Given that your search set might be very large, I would recommend just iterating that set and checking for a potential substring match:

  1. public boolean containsSubstring(String input, Set&lt;String&gt; subs) {
  2. boolean match = false;
  3. for (String sub : subs) {
  4. if (input.contains(sub)) {
  5. match = true;
  6. break;
  7. }
  8. }
  9. return match;
  10. }

答案2

得分: 2

首先,您准备好dictionary,就像这样:

  1. Set<String> stringSet = Set.of("xy", "xyz", "zzz", "zzy", "cccc", "dddd");
  2. Map<Character, List<String>> dictionary = new HashMap<>();
  3. for (String word : stringSet)
  4. dictionary.computeIfAbsent(word.charAt(0), k -> new ArrayList<>()).add(word);
  5. System.out.println(dictionary);

输出:

  1. {c=[cccc], d=[dddd], x=[xyz, xy], z=[zzy, zzz]}

然后,您可以使用以下方法来查找:

  1. static boolean contains(String input, Map<Character, List<String>> dictionary) {
  2. for (int i = 0, max = input.length(); i < max; ++i) {
  3. char first = input.charAt(i);
  4. if (dictionary.containsKey(first))
  5. for (String word : dictionary.get(first))
  6. if (input.startsWith(word, i))
  7. return true;
  8. }
  9. return false;
  10. }
英文:

First of all, you prepare the dictionary. just like this

  1. Set&lt;String&gt; stringSet = Set.of(&quot;xy&quot;, &quot;xyz&quot;, &quot;zzz&quot;, &quot;zzy&quot;, &quot;cccc&quot;, &quot;dddd&quot;);
  2. Map&lt;Character, List&lt;String&gt;&gt; dictionary = new HashMap&lt;&gt;();
  3. for (String word : stringSet)
  4. dictionary.computeIfAbsent(word.charAt(0), k -&gt; new ArrayList&lt;&gt;()).add(word);
  5. System.out.println(dictionary);

output:

  1. {c=[cccc], d=[dddd], x=[xyz, xy], z=[zzy, zzz]}

And you can use this method to find out.

  1. static boolean contains(String input, Map&lt;Character, List&lt;String&gt;&gt; dictionary) {
  2. for (int i = 0, max = input.length(); i &lt; max; ++i) {
  3. char first = input.charAt(i);
  4. if (dictionary.containsKey(first))
  5. for (String word : dictionary.get(first))
  6. if (input.startsWith(word, i))
  7. return true;
  8. }
  9. return false;
  10. }

答案3

得分: 0

我找到了Aho-Corasick算法的Java实现,这正是我想要的。感谢Clashsoft的提示。

英文:

With the hint of Clashsoft,I found the java implementation of Aho-Corasick algorithm , which is the one i want ,thanks for Clashsoft

huangapple
  • 本文由 发表于 2020年8月6日 16:57:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/63280141.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定