在golang中对多个int切片进行序列对齐

huangapple go评论70阅读模式
英文:

Sequence alignment of multiple slices of ints in golang

问题

我正在尝试弄清楚如何使用Go语言对稍有缺陷的二进制切片进行对齐。以下四个切片都以不同的偏移量正确对齐。然而,并不是每个位都相同(在下面标记出来),所以我不能只比较原始块。

func main() {

    // Match all three slices up (ignoring occasional errors)
    s1 := []int16{0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1}
    s2 := []int16{ /*                     */ 0, 1, 1, 0, 0, 0, 1, 1, 1, 1}
    //                                       ^              ^
    s3 := []int16{0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1}
    //               ^
    s4 := []int16{ /*            */ 0, 0, 0, 1, 1, 1, 0, 0}
  
    slices := make([][]int16, 3)
    slices = append(slices, s1, s2, s3, s4)


    offsets := forgivingSyncHere(slices)
}

这里是一个示例:https://play.golang.org/p/zqJ_4qLc8O

英文:

I am trying to figure out how I can align slightly imperfect binary slices using golang. The following four slices all align correctly with different offsets. However, not every bit is the same (marked below) so I can't just compare raw chunks.

func main() {

	// Match all three slices up (ignoring occasional errors)
	s1 := []int16{0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1}
	s2 := []int16{ /*                     */ 0, 1, 1, 0, 0, 0, 1, 1, 1, 1}
	//                                       ^              ^
	s3 := []int16{0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1}
	//               ^
	s4 := []int16{ /*            */ 0, 0, 0, 1, 1, 1, 0, 0}
  
	slices := make([][]int16, 3)
	slices = append(slices, s1, s2, s3, s4)


	offsets := forgivingSyncHere(slices)
}

Here is a https://play.golang.org/p/zqJ_4qLc8O

答案1

得分: 2

这取决于你的“成本”函数,其中你的目标是最小化你的“成本”。

成本函数可以是这样的。思路是,“不匹配”比没有任何匹配的情况更昂贵,我们称之为“溢出”(比如说两倍的代价)。计算aba[i] != b[i + offset]的情况的数量,并将其乘以2。然后,对于每个配对(在这个例子中是4个数组的6个配对),将每个offset的绝对值相加,表示开头的溢出数量。然后再加上结尾的溢出。

示例成本函数:

func cost(sn [][]int16, offsets [][]int) int {
  // 成本累加器
  c := 0.0

  // 不匹配比溢出更昂贵的因子
  mismatchFactor := 2.0

  // 根据你的需求定义,这里是我上面说的例子
  for i1:=0;i1<len(sn);i++ {
    for i2:=i1+1;i2<len(sn);i2++ {
      c += mismatchFactor * diff(sn[i1], sn[i2], offsets[i1][i2])
      c += math.Abs(offsets[i1][i2])
      c += math.Abs(len(sn[i1]) + offsets[i1][i2] - len(sn[i2]))
    }
  }
}

// s1相对于s2的偏移量的偏移量
func diff(s1 []int16, s2 []int16, offset int) int {
  // 索引,索引,差异总数
  i1,i2,d := 0,0,0
  if offset >= 0 {
    i1 += offset
  } else {
    i2 -= offset
  }
  for i1<len(s1) && i2<len(s2) {
    if s1[i1] != s2[i2] {
      d++
    }
    i1++
    i2++
  }
  return d
}

你可以根据自己的需求定义成本函数,这只是一个示例。不过,假设你已经有了一个成本函数,那么一个暴力算法很容易想出来。当然,你可以尝试优化算法 :)。有很多想法。这与字符串搜索算法和编辑距离非常相似。维基百科和谷歌上有很多结果。

免责声明:这些都没有经过测试:),但应该能帮你入门。

英文:

It depends on what your "cost" function is, where your goal is to minimize your "cost".

A cost function could be something like this. The idea is that a "mismatch" is more costly than if there isn't anything to match, which we'll call "overruns" (say twice as costly). Take the number of cases where a[i] != b[i + offset] for a and b equal to s1,s2,s3,s4 and double it. Then add to that the absolute value of each offset for each pairing (in this case 6 pairings for 4 arrays) for the number of overruns at the beginning. Then add onto that the overruns at the end.

Sample cost function:

func cost(sn [][]int16, offsets [][]int) int {
  // cost accumulator
  c := 0.0

  // the factor of how much more costly a mismatch is than an overrun
  mismatchFactor := 2.0

  // define what you want, but here is an example of what I said above
  for i1:=0;i1&lt;len(sn);i++ {
    for i2:=i1+1;i2&lt;len(sn);i2++ {
      c += mismatchFactor * diff(sn[i1], sn[i2], offsets[i1][i2])
      c += math.Abs(offsets[i1][i2])
      c += math.Abs(len(sn[i1]) + offsets[i1][i2] - len(sn[i2]))
    }
  }
}

// offset of the offset of s1 wrt s2
func diff(s1 []int16, s2 []int16, offset int) int {
  // index, index, diff total
  i1,i2,d := 0,0,0
  if offset &gt;= 0 {
    i1 += offset
  } else {
    i2 -= offset
  }
  while i1&lt;len(s1) &amp;&amp; i2&lt;len(s2) {
    if s1[i1] != s2[i2] {
      d++
    }
    i1++
    i2++
  }
  return d
}

Make your cost function however you want, this is just an example. However, assuming you have a cost function, a brute force algorithm is pretty easy to come up with. You can try to optimize the algorithm, though :). There are many ideas. This is very similar to string search algorithms, with edit distances. Wikipedia and Google have many results.

Disclaimer: all of this is untested :), but it should get you started

答案2

得分: 0

相似性可以用Levenshtein距离来描述,也称为编辑距离——一个字符串需要经历多少次插入、删除和变异才能变成另一个字符串。

这使得你可以定量地讨论相似程度,例如编辑距离与两个字符串长度中较短的那个长度的比值可能是一个合理的度量标准,但这取决于你对相似的具体定义。

从那里开始,你可以找到每对要比较的字符串中相同子字符串的长度和数量的下界。在你的例子中,通过直观观察,看起来你认为4位是一个匹配。也就是说,如果你使用4位的块,并检查给定序列对的精确匹配,那么匹配块的数量对于它们的相似性有什么意义?

如果你对足够小的子字符串进行精确比较,即使差异均匀分布(如果它们聚集在一起,较大的部分将有精确匹配,因此较短字符串的长度除以最大编辑距离),你至少保证有几个匹配。

对于考虑相似性位置的通用近似字符串匹配的精确解决方案是计算密集型的(一个相关且经过充分研究的问题是最长公共子序列),而且还必须应用于要比较的每对序列,而不是每个序列独立地应用。相比之下,选择适当的块大小并通过其可能的子字符串对每个字符串进行索引只取决于每个字符串本身,并且可能提供一个近似的答案。当然,你也可以注意位置并进一步约束。

总之,对于你描述的问题,一个朴素的解决方案可能是难以处理的,在实践中,序列对齐和近似字符串匹配是通过解决一个更简单的问题来完成的,然后用一个更朴素/昂贵的方法来验证这些解决方案。

Rabin指纹是一种比n-gram(滑动窗口)更复杂的方法,用于将字符串分解为子字符串,但它是一种随机方法,并且会产生大小可变的字符串(但是确定性的),因此确定边界会更加复杂一些。

英文:

Similarity can be described using Levenshtein distance, aka edit distance - the number of insertions, deletions and mutations a string has to undergo to become another.

This allows you to talk quantitatively about the degree of similarity, for example the ratio of the edit distance to the length shorter of the two string lengths might be a reasonable metric, but this depends on what exactly you mean by similar.

From there you can found a lower bound for the length and number of identical substrings in each pair of strings to be compared. In your case, by eyeballing your example, looks like 4 bits are what you consider a match. IOW, if you use 4 bit chunks, and check for exact matches for a given pair of sequences, what does the number of matching chunks tell you about their similarity?

If you do exact comparisons on small enough substrings, you're guaranteed at least a few matches, even if the differences are evenly spaced (if they're clustered, larger sections will have exact matches, so the length of the shorter string divided by the maximum edit distance).

An exact solution for generic approximate matching of strings that takes the position of the similarities into account is computationally intensive (and a related and well studied problem is the longest common subsequence), and furthermore has to be applied for each pair of sequences to be compared, not each sequence independently. In contrast, choosing an appropriate chunk size and indexing each string by its possible substrings only depends on each string in isolation, and may provide an approximate answer. You could of course note the position and further constrain that as well.

To sum it up, a naive solution for the problem you're describing is likely to be intractable, and in practice sequence alignment and approximate string matching is done by solving a simpler problem instead, and then double checking those solutions with a more naive/expensive approach.

Rabin Fingerprinting is a more sophisticated method of breaking up strings into substrings than n-grams (sliding window), but it is a stochastic approach and results in variable sized strings (but deterministic ones), so working out the bounds is a bit more involved.

huangapple
  • 本文由 发表于 2017年2月25日 01:54:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/42445243.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定