后缀数组的实际实现

huangapple go评论130阅读模式
英文:

Practical implementation of suffix array

问题

  1. 我们如何知道这个算法正确工作?该论文没有提供证明,我也不理解按照前缀相对位置排序是如何保持前缀排序的。

  2. 在纯英文中,上述代码中变量 suffixesposEntry.prefix 的目的是什么?在C++代码中,它们分别被命名为 LPnr

  3. 在第一次迭代(step=1)时,suffixes[i].prefix(在C++中为 L[i].nr)成为了一个以 text[i] 开始的长度为2的前缀的起始和结束。suffixes(在C++中为 L)被排序,并且位置被存储在 pos(在C++中为 P)中。在下一次迭代(step=2)中,suffixes[i].prefix 数组是基于 pos[1] 填充的,所以它不再是长度为4的前缀的起始和结束。那么 suffixes[i].prefix 是什么?

  4. 在第6页中,说:“后缀数组将在矩阵 P 的最后一行找到”,但这似乎是不正确的。实际上,后缀数组由数组 suffixes(在C++中为 L)的 start(在C++中为 p)值给出。

英文:

Looking for a practical implementation of suffix arrays, I came across this paper. It outlines a O(n (log n * log n)) approach, where n is the length of the string. While there are faster algorithms available, IMO, none is suitable in a programming contest or interview setting. But I found the explanation of the algorithm really lacking. I quote:

> The algorithm is mainly based on maintaining the order of the string’s
> suffixes sorted by their 2<sup>k</sup> long prefixes. We shall execute
> m = [log<sub>2</sub> n] (ceil) steps, computing the order of the prefixes of
> length 2<sup>k</sup> at the k<sup>th</sup> step. It is used an m x n
> sized matrix. Let’s denote by A<sub>i</sub><sup>k</sup> the
> subsequence of A of length 2<sup>k</sup> starting at position i. The
> position of A<sub>i</sub><sup>k</sup> in the sorted array of
> A<sub>j</sub><sup>k</sup> subsequences (j = 1, n) is kept in P<sub>(k,
> i)</sub>.
>
> When passing from step k to step k + 1, all the pairs of subsequences
> A<sub>i</sub><sup>k</sup> and A<sub>i+2^k</sub><sup>k</sup> are
> concatenated, therefore obtaining the substrings of length
> 2<sup>k+1</sup>. For establishing the new order relation among these,
> the information computed at the previous step must be used. For each
> index i it is kept a pair formed by P<sub>(k, i)</sub> and P<sub>(k,
> i+2^k)</sub>. The fact that i + 2<sup>k</sup> may exceed the string’s
> bounds must not bother us, because we shall fill the string with the
> "$" character, about which we shall consider that it’s
> lexicographically smaller than any other character. After sorting, the
> pairs will be arranged conforming to the lexicographic order of the
> strings of length 2<sup>k+1</sup>. One last thing that must be
> remembered is that at a certain step k there may be several substrings
> A<sub>i</sub><sup>k</sup> = A<sub>j</sub><sup>k</sup>, and these must
> be labeled identically (P<sub>(k, i)</sub> must be equal to P<sub>(k,
> j)</sub>).

The paper provides a C++ implementation, but it's very difficult to glean any insight from the code when the variables are named L, P, and nr, so, I made a simplified implementation in Python. Note that the variable names are to the best of my understanding, which is what I'm hoping to get more clarification on.

def suffix_array(text: str) -&gt; list[int]:
    @dataclasses.dataclass(order=True)
    class Entry:
        prefix: list[int] = dataclasses.field(default_factory=lambda: [0, 0])
        start: int = dataclasses.field(default_factory=int, compare=False)

    n = len(text)
    m = math.ceil(math.log2(n)) + 2
    pos = [ord(x) for x in text]
    suffixes = [Entry() for _ in range(n)]
    for step in range(1, m):
        cnt = 2 ** (step - 1)
        for i in range(n):
            suffixes[i].prefix[0] = pos[i]
            suffixes[i].prefix[1] = pos[j] if (j := i + cnt) &lt; n else -1
            suffixes[i].start = i
        suffixes.sort()
        for i in range(n):
            if i &gt; 0 and suffixes[i] == suffixes[i - 1]:
                pos[suffixes[i].start] = pos[suffixes[i - 1].start]
            else:
                pos[suffixes[i].start] = i

    return [i.start for i in suffixes]

Now, my questions:

  1. How do we know the algorithm works correctly? The paper provides no proof, and I don't understand how sorting the relative positions of the prefixes is also keeping the prefixes sorted.
  2. In plain English, what are the purposes of the variables suffixes, pos and Entry.prefix in the code above? These are named L, P, and nr, respectively, in the C++ code.
  3. At the first iteration (step=1), suffixes[i].prefix (L[i].nr in C++) becomes the start and the end of a prefix of length 2 starting at text[i]. suffixes (L in C++) is sorted, and positions are stored in pos (P in C++). At the next iteration (step=2), suffixes[i].prefix array is populated based on pos[1], so, it is no longer the start and end of a prefix of length 4. What is suffixes[i].prefix then?
  4. On page 6, it is said "The suffix array will be found on the last row of matrix P", but this appears to be incorrect. The suffix array is actually given by the start (p in C++) values of the array suffixes (L in C++).

答案1

得分: 1

OP在这里。关于这个算法的详细解释可以在Steven Halim的《竞技程序设计3》第6.6.4章中找到。该算法的可视化可以在这里找到。

我明白了,以下是我的笔记。

基本上,这个算法的工作方式有点像基数排序,先对输入的部分进行排序,然后对其余部分进行排序。通过对整数而不是字符串进行排序,该算法避免了使朴素算法变成二次方复杂度的字符串比较。
在每次迭代中,算法执行以下操作:

  1. 根据上一次迭代中计算的排名,更新排名对。
  2. 根据更新后的排名对对后缀进行排序。
  3. 重新计算后缀的排名。为了计算新的排名,我们将第一个后缀的排名设置为新排名r = 0。然后,我们从i = [1..n-1]进行迭代。如果后缀'i'的排名对与已排序顺序中前一个后缀的排名对不同,我们将排名r = i。否则,排名保持为r。

示例:文本="GATAGACA"。

迭代1,考虑长度为2的后缀。排序前和排序后如下:

+---+--------+--------------+     +---+--------+--------------+------+
| i | Suffix | ranking_pair |     | i | Suffix | ranking_pair | Rank |
+---+--------+--------------+     +---+--------+--------------+------+
| 0 | GA     | [71, 65]     |     | 7 | A      | [65, -1]     |    0 |
| 1 | AT     | [65, 84]     |     | 5 | AC     | [65, 67]     |    1 |
| 2 | TA     | [84, 65]     |     | 3 | AG     | [65, 71]     |    2 |
| 3 | AG     | [65, 71]     |     | 1 | AT     | [65, 84]     |    3 |
| 4 | GA     | [71, 65]     |     | 6 | CA     | [67, 65]     |    4 |
| 5 | AC     | [65, 67]     |     | 0 | GA     | [71, 65]     |    5 |
| 6 | CA     | [67, 65]     |     | 4 | GA     | [71, 65]     |    5 |
| 7 | A      | [65, -1]     |     | 2 | TA     | [84, 65]     |    7 |
+---+--------+--------------+     +---+--------+--------------+------+

迭代2,考虑长度为4的后缀。排序前和排序后如下:

+---+--------+--------------+     +---+--------+--------------+------+
| i | Suffix | ranking_pair |     | i | Suffix | ranking_pair | Rank |
+---+--------+--------------+     +---+--------+--------------+------+
| 7 | A      | [0, 2]       |     | 7 | A      | [0, 2]       |    0 |
| 5 | AC     | [1, 3]       |     | 5 | AC     | [1, 3]       |    1 |
| 3 | AGAC   | [2, 4]       |     | 3 | AGAC   | [2, 4]       |    2 |
| 1 | ATAG   | [3, 5]       |     | 1 | ATAG   | [3, 5]       |    3 |
| 6 | CA     | [4, 5]       |     | 6 | CA     | [4, 5]       |    4 |
| 0 | GATA   | [5, 7]       |     | 4 | GACA   | [5, 6]       |    5 |
| 4 | GACA   | [5, -1]      |     | 0 | GATA   | [5, -1]      |    6 |
| 2 | TAGA   | [7, -1]      |     | 2 | TAGA   | [6, -1]      |    7 |
+---+--------+--------------+     +---+--------+--------------+------+

迭代3没有显示,考虑长度为8的后缀。

英文:

OP here. A detailed explanation of this algorithm is given in the book Competitive Programming 3 by Steven Halim - Chapter 6.6.4 Suffix Array. A visualization of the algorithm is available here.

I got it now, here are my notes.

Basically, the algorithm works sort of like Radix Sort, whereas parts of the input are sorted first, then the parts after that. By sorting integers and not strings, the algorithm avoids the string comparisons that make the naive algorithm quadratic.
At each iteration, the algorithm does the following:

  1. Update the ranking pair based on the ranks computed in the previous iteration.
  2. Sort the suffixes based on the updated ranking pairs.
  3. Recompute the ranks of the suffixes. To calculate the new ranks, we set the first one to have new rank r = 0. Then, we iterate from i = [1..n-1]. If the ranking pair of suffix 'i' is different from the ranking pair of the previous suffix in sorted order, we set the rank r = i. Otherwise, the rank stays at r.

Example: text="GATAGACA".

Iteration 1, suffixes of length 2 are considered. Before and after sorting:

+---+--------+--------------+     +---+--------+--------------+------+
| i | Suffix | ranking_pair |     | i | Suffix | ranking_pair | Rank |
+---+--------+--------------+     +---+--------+--------------+------+
| 0 | GA     | [71, 65]     |     | 7 | A      | [65, -1]     |    0 |
| 1 | AT     | [65, 84]     |     | 5 | AC     | [65, 67]     |    1 |
| 2 | TA     | [84, 65]     |     | 3 | AG     | [65, 71]     |    2 |
| 3 | AG     | [65, 71]     |     | 1 | AT     | [65, 84]     |    3 |
| 4 | GA     | [71, 65]     |     | 6 | CA     | [67, 65]     |    4 |
| 5 | AC     | [65, 67]     |     | 0 | GA     | [71, 65]     |    5 |
| 6 | CA     | [67, 65]     |     | 4 | GA     | [71, 65]     |    5 |
| 7 | A      | [65, -1]     |     | 2 | TA     | [84, 65]     |    7 |
+---+--------+--------------+     +---+--------+--------------+------+

Iteration 2, suffixes of length 4 are considered. Before and after sorting:

+---+--------+--------------+     +---+--------+--------------+------+
| i | Suffix | ranking_pair |     | i | Suffix | ranking_pair | Rank |
+---+--------+--------------+     +---+--------+--------------+------+
| 7 | A      | [0, 2]       |     | 7 | A      | [0, 2]       |    0 |
| 5 | AC     | [1, 3]       |     | 5 | AC     | [1, 3]       |    1 |
| 3 | AGAC   | [2, 4]       |     | 3 | AGAC   | [2, 4]       |    2 |
| 1 | ATAG   | [3, 5]       |     | 1 | ATAG   | [3, 5]       |    3 |
| 6 | CA     | [4, 5]       |     | 6 | CA     | [4, 5]       |    4 |
| 0 | GATA   | [5, 7]       |     | 4 | GACA   | [5, 6]       |    5 |
| 4 | GACA   | [5, -1]      |     | 0 | GATA   | [5, -1]      |    6 |
| 2 | TAGA   | [7, -1]      |     | 2 | TAGA   | [6, -1]      |    7 |
+---+--------+--------------+     +---+--------+--------------+------+

Iteration 3 is not shown, suffixes of length 8 are considered.

huangapple
  • 本文由 发表于 2023年7月10日 14:28:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76651153.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定