Golang: 进程运行时间过长。实现拼写检查器

huangapple go评论109阅读模式
英文:

Golang : process took too long. Implementing Spell Checker

问题

我已经用Go实现了一些Peter Norvig的拼写检查算法。

奇怪的是,前三次调用都能正确地给我所需的输出。

但是从第二次开始,它显示"进程花费的时间太长"。

有人可以看一下我的代码,告诉我出了什么问题吗?

以下是可能出错的代码片段。

在英文版本中,相同的代码完全正常工作。

根据语言的不同,UNICODE格式和边界发生了变化,因为英文每个字母占用1个字节,而亚洲语言在这种情况下每个字符占用3个字节。

这是尝试运行与之前完美工作的英文版本相同的算法。
但是这个版本不起作用。

total_set := []string{}
for _, elem := range splits {

	if len(elem.str2) > 3 {
		//删除
		total_set = append(total_set, elem.str1+elem.str2[3:])

		//替换
		for i:=0; i<len(koreanletter)/3; i++ {
			total_set = append(total_set, elem.str1+string(koreanletter[3*i:3*(i+1)])+elem.str2[3:])
		}

		//转置
		if len(elem.str2) > 9 {
			total_set = append(total_set, elem.str1+string(elem.str2[3:6])+string(elem.str2[:3])+elem.str2[9:])
		}

	} else {
		//删除
		total_set = append(total_set, elem.str1)
	}

	//插入
	for _, c := range koreanletter {
		total_set = append(total_set, elem.str1+string(c)+elem.str2)
	}
    return RemoveDuplicateStringArrayForKorean(total_set)
}

英文版本如下。
这个版本完美工作。

//Edits1用于测量字符串之间的距离。
func (model *Model) Edits1(word string) []string {
  const alphabet = "abcdefghijklmnopqrstuvwxyz"

  splits := []Pair{}
  for i := 0; i <= len(word); i++ {
    splits = append(splits, Pair{word[:i], word[i:]})
  }

  total_set := []string{}
  for _, elem := range splits {

    if len(elem.str2) > 0 {
      //删除
      total_set = append(total_set, elem.str1+elem.str2[1:])

      //替换
      for _, c := range alphabet {
        total_set = append(total_set, elem.str1+string(c)+elem.str2[1:])
      }

      //转置
      if len(elem.str2) > 1 {
        total_set = append(total_set, elem.str1+string(elem.str2[1])+string(elem.str2[0])+elem.str2[2:])
      }

    } else {
      //删除
      total_set = append(total_set, elem.str1)
    }

    //插入
    for _, c := range alphabet {
      total_set = append(total_set, elem.str1+string(c)+elem.str2)
    }
  }
  return RemoveDuplicateStringArrayLowerCase(total_set)
}

补充说明:有序参数,现在有三个工作正常。

koreanletter中没有任何字符丢失。

有没有办法可以更具体地查看错误?我只是无法弄清楚。

英文:

http://play.golang.org/p/H5E0ExL85d

I've implemented some Peter Norvig's spelling check algorithm with Go.

It's weird that the FIRST THREE calling works correct giving me the desired output.

But from the second, it is saying "process took too long."

Could anybody look at my code and tell what goes wrong?

Here's the snippet that is possibly going wrong.

Everything works perfect with the same code in English version.

UNICODE format and boundary have changed according to language because English contain 1 byte per alphabet and Asian languages in this case contain 3 bytes per one character.

This is trying to run the same Algorithm as the one for English that was working perfect.
But this is NOT working.

total_set := []string{}
for _, elem := range splits {

	if len(elem.str2) &gt; 3 {
		//deletion
		total_set = append(total_set, elem.str1+elem.str2[3:])

		//replace
		for i:=0; i&lt;len(koreanletter)/3; i++ {
			total_set = append(total_set, elem.str1+string(koreanletter[3*i:3*(i+1)])+elem.str2[3:])
		}

		//transpose
		if len(elem.str2) &gt; 9 {
			total_set = append(total_set, elem.str1+string(elem.str2[3:6])+string(elem.str2[:3])+elem.str2[9:])
		}

	} else {
		//deletion
		total_set = append(total_set, elem.str1)
	}

	//insertion
	for _, c := range koreanletter {
		total_set = append(total_set, elem.str1+string(c)+elem.str2)
	}
    return RemoveDuplicateStringArrayForKorean(total_set)
}

The one for English is below.
This is working perfect.

//Edits1 is to measure the distance between strings.
func (model *Model) Edits1(word string) []string {
  const alphabet = &quot;abcdefghijklmnopqrstuvwxyz&quot;

  splits := []Pair{}
  for i := 0; i &lt;= len(word); i++ {
    splits = append(splits, Pair{word[:i], word[i:]})
  }

  total_set := []string{}
  for _, elem := range splits {

    if len(elem.str2) &gt; 0 {
      //deletion
      total_set = append(total_set, elem.str1+elem.str2[1:])

      //replace
      for _, c := range alphabet {
        total_set = append(total_set, elem.str1+string(c)+elem.str2[1:])
      }

      //transpose
      if len(elem.str2) &gt; 1 {
        total_set = append(total_set, elem.str1+string(elem.str2[1])+string(elem.str2[0])+elem.str2[2:])
      }

    } else {
      //deletion
      total_set = append(total_set, elem.str1)
    }

    //insertion
    for _, c := range alphabet {
      total_set = append(total_set, elem.str1+string(c)+elem.str2)
    }
  }
  return RemoveDuplicateStringArrayLowerCase(total_set)
}

Addition: ordered arguments and now I have three things working.

None of the characters are missing from the koreanletter.

Is there anyway that I can see the error more specifically? I just can't figure out.

答案1

得分: 5

玩弄你的代码后,似乎是你的KoreanKnownEdits2函数花费了太长时间。在你的第四个例子(失败的那个)中,model.KoreanEdits1(input_word)的长度为28197,而第一个model.KoreanEdits1(elem1)的长度为23499,这样就有大约6.62亿个要尝试的情况。似乎程序在前14.7万个之后失败了,因为时间太长(playground)。

任何不需要调用KoreanKnownEdits2的例子似乎都能正常工作,所以我怀疑你应该重写这个函数,避免穷举搜索,或者至少将其限制在一个更合理的大小范围内,如果你想在playground的时间限制内使用它。我还没有详细研究你的代码,无法百分之百确定,但我怀疑西方字母表的26个字母使得英文版本的大小可控,而扩展的韩文字母使得你的输入大小过大,无法在playground的时间限制内处理,无论每个字符的字节编码是多少。

英文:

Playing around with your code, it seems it is your KoreanKnownEdits2 which is taking too long. In your forth example (the one failing), the length of model.KoreanEdits1(input_word) is 28197 and the length of the first model.KoreanEdits1(elem1) is 23499, which makes around 662 millions cases to try. It seems the program is failing after the first 147 thousands, because it takes too long (playground).

Any examples which did not required a call to KoreanKnownEdits2 seem to work, so I suspect you should rewrite this function to avoid the exhaustive search, or at least limit it to a more reasonable size if you want to use it with the playground's time limit. I haven't studied your code in enough details to be 100% certain of that, but I suspect the 26 letters of western alphabet make it manageable for the English version, while the extended Korean alphabet makes the size of your input too large to be processed on the playground's time limit, regardless of the number of bytes each character is encoded with.

huangapple
  • 本文由 发表于 2013年11月6日 15:49:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/19806628.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定