2023年7月10日 14:28:01go评论130阅读模式

英文:

Practical implementation of suffix array

问题

我们如何知道这个算法正确工作？该论文没有提供证明，我也不理解按照前缀相对位置排序是如何保持前缀排序的。
在纯英文中，上述代码中变量 suffixes，pos 和 Entry.prefix 的目的是什么？在C++代码中，它们分别被命名为 L，P 和 nr。
在第一次迭代（step=1）时，suffixes[i].prefix（在C++中为 L[i].nr）成为了一个以 text[i] 开始的长度为2的前缀的起始和结束。suffixes（在C++中为 L）被排序，并且位置被存储在 pos（在C++中为 P）中。在下一次迭代（step=2）中，suffixes[i].prefix 数组是基于 pos[1] 填充的，所以它不再是长度为4的前缀的起始和结束。那么 suffixes[i].prefix 是什么？
在第6页中，说：“后缀数组将在矩阵 P 的最后一行找到”，但这似乎是不正确的。实际上，后缀数组由数组 suffixes（在C++中为 L）的 start（在C++中为 p）值给出。

英文:

Looking for a practical implementation of suffix arrays, I came across this paper. It outlines a O(n (log n * log n)) approach, where n is the length of the string. While there are faster algorithms available, IMO, none is suitable in a programming contest or interview setting. But I found the explanation of the algorithm really lacking. I quote:

> The algorithm is mainly based on maintaining the order of the string’s
> suffixes sorted by their 2k long prefixes. We shall execute
> m = [log2 n] (ceil) steps, computing the order of the prefixes of
> length 2k at the kth step. It is used an m x n
> sized matrix. Let’s denote by Aik the
> subsequence of A of length 2k starting at position i. The
> position of Aik in the sorted array of
> Ajk subsequences (j = 1, n) is kept in P(k,
> i).
>
> When passing from step k to step k + 1, all the pairs of subsequences
> Aik and Ai+2^kk are
> concatenated, therefore obtaining the substrings of length
> 2k+1. For establishing the new order relation among these,
> the information computed at the previous step must be used. For each
> index i it is kept a pair formed by P(k, i) and P(k,
> i+2^k). The fact that i + 2k may exceed the string’s
> bounds must not bother us, because we shall fill the string with the
> "$" character, about which we shall consider that it’s
> lexicographically smaller than any other character. After sorting, the
> pairs will be arranged conforming to the lexicographic order of the
> strings of length 2k+1. One last thing that must be
> remembered is that at a certain step k there may be several substrings
> Aik = Ajk, and these must
> be labeled identically (P(k, i) must be equal to P(k,
> j)).

The paper provides a C++ implementation, but it's very difficult to glean any insight from the code when the variables are named L, P, and nr, so, I made a simplified implementation in Python. Note that the variable names are to the best of my understanding, which is what I'm hoping to get more clarification on.

def suffix_array(text: str) -&gt; list[int]:
    @dataclasses.dataclass(order=True)
    class Entry:
        prefix: list[int] = dataclasses.field(default_factory=lambda: [0, 0])
        start: int = dataclasses.field(default_factory=int, compare=False)

    n = len(text)
    m = math.ceil(math.log2(n)) + 2
    pos = [ord(x) for x in text]
    suffixes = [Entry() for _ in range(n)]
    for step in range(1, m):
        cnt = 2 ** (step - 1)
        for i in range(n):
            suffixes[i].prefix[0] = pos[i]
            suffixes[i].prefix[1] = pos[j] if (j := i + cnt) &lt; n else -1
            suffixes[i].start = i
        suffixes.sort()
        for i in range(n):
            if i &gt; 0 and suffixes[i] == suffixes[i - 1]:
                pos[suffixes[i].start] = pos[suffixes[i - 1].start]
            else:
                pos[suffixes[i].start] = i

    return [i.start for i in suffixes]

Now, my questions:

How do we know the algorithm works correctly? The paper provides no proof, and I don't understand how sorting the relative positions of the prefixes is also keeping the prefixes sorted.
In plain English, what are the purposes of the variables suffixes, pos and Entry.prefix in the code above? These are named L, P, and nr, respectively, in the C++ code.
At the first iteration (step=1), suffixes[i].prefix (L[i].nr in C++) becomes the start and the end of a prefix of length 2 starting at text[i]. suffixes (L in C++) is sorted, and positions are stored in pos (P in C++). At the next iteration (step=2), suffixes[i].prefix array is populated based on pos[1], so, it is no longer the start and end of a prefix of length 4. What is suffixes[i].prefix then?
On page 6, it is said "The suffix array will be found on the last row of matrix P", but this appears to be incorrect. The suffix array is actually given by the start (p in C++) values of the array suffixes (L in C++).

答案1

得分: 1

OP在这里。关于这个算法的详细解释可以在Steven Halim的《竞技程序设计3》第6.6.4章中找到。该算法的可视化可以在这里找到。

我明白了，以下是我的笔记。

基本上，这个算法的工作方式有点像基数排序，先对输入的部分进行排序，然后对其余部分进行排序。通过对整数而不是字符串进行排序，该算法避免了使朴素算法变成二次方复杂度的字符串比较。
在每次迭代中，算法执行以下操作：

根据上一次迭代中计算的排名，更新排名对。
根据更新后的排名对对后缀进行排序。
重新计算后缀的排名。为了计算新的排名，我们将第一个后缀的排名设置为新排名r = 0。然后，我们从i = [1..n-1]进行迭代。如果后缀'i'的排名对与已排序顺序中前一个后缀的排名对不同，我们将排名r = i。否则，排名保持为r。

示例：文本="GATAGACA"。

迭代1，考虑长度为2的后缀。排序前和排序后如下：

+---+--------+--------------+     +---+--------+--------------+------+
| i | Suffix | ranking_pair |     | i | Suffix | ranking_pair | Rank |
+---+--------+--------------+     +---+--------+--------------+------+
| 0 | GA     | [71, 65]     |     | 7 | A      | [65, -1]     |    0 |
| 1 | AT     | [65, 84]     |     | 5 | AC     | [65, 67]     |    1 |
| 2 | TA     | [84, 65]     |     | 3 | AG     | [65, 71]     |    2 |
| 3 | AG     | [65, 71]     |     | 1 | AT     | [65, 84]     |    3 |
| 4 | GA     | [71, 65]     |     | 6 | CA     | [67, 65]     |    4 |
| 5 | AC     | [65, 67]     |     | 0 | GA     | [71, 65]     |    5 |
| 6 | CA     | [67, 65]     |     | 4 | GA     | [71, 65]     |    5 |
| 7 | A      | [65, -1]     |     | 2 | TA     | [84, 65]     |    7 |
+---+--------+--------------+     +---+--------+--------------+------+

迭代2，考虑长度为4的后缀。排序前和排序后如下：

+---+--------+--------------+     +---+--------+--------------+------+
| i | Suffix | ranking_pair |     | i | Suffix | ranking_pair | Rank |
+---+--------+--------------+     +---+--------+--------------+------+
| 7 | A      | [0, 2]       |     | 7 | A      | [0, 2]       |    0 |
| 5 | AC     | [1, 3]       |     | 5 | AC     | [1, 3]       |    1 |
| 3 | AGAC   | [2, 4]       |     | 3 | AGAC   | [2, 4]       |    2 |
| 1 | ATAG   | [3, 5]       |     | 1 | ATAG   | [3, 5]       |    3 |
| 6 | CA     | [4, 5]       |     | 6 | CA     | [4, 5]       |    4 |
| 0 | GATA   | [5, 7]       |     | 4 | GACA   | [5, 6]       |    5 |
| 4 | GACA   | [5, -1]      |     | 0 | GATA   | [5, -1]      |    6 |
| 2 | TAGA   | [7, -1]      |     | 2 | TAGA   | [6, -1]      |    7 |
+---+--------+--------------+     +---+--------+--------------+------+

迭代3没有显示，考虑长度为8的后缀。

英文:

OP here. A detailed explanation of this algorithm is given in the book Competitive Programming 3 by Steven Halim - Chapter 6.6.4 Suffix Array. A visualization of the algorithm is available here.

I got it now, here are my notes.

Basically, the algorithm works sort of like Radix Sort, whereas parts of the input are sorted first, then the parts after that. By sorting integers and not strings, the algorithm avoids the string comparisons that make the naive algorithm quadratic.
At each iteration, the algorithm does the following:

Update the ranking pair based on the ranks computed in the previous iteration.
Sort the suffixes based on the updated ranking pairs.
Recompute the ranks of the suffixes. To calculate the new ranks, we set the first one to have new rank r = 0. Then, we iterate from i = [1..n-1]. If the ranking pair of suffix 'i' is different from the ranking pair of the previous suffix in sorted order, we set the rank r = i. Otherwise, the rank stays at r.

Example: text="GATAGACA".

Iteration 1, suffixes of length 2 are considered. Before and after sorting:

+---+--------+--------------+     +---+--------+--------------+------+
| i | Suffix | ranking_pair |     | i | Suffix | ranking_pair | Rank |
+---+--------+--------------+     +---+--------+--------------+------+
| 0 | GA     | [71, 65]     |     | 7 | A      | [65, -1]     |    0 |
| 1 | AT     | [65, 84]     |     | 5 | AC     | [65, 67]     |    1 |
| 2 | TA     | [84, 65]     |     | 3 | AG     | [65, 71]     |    2 |
| 3 | AG     | [65, 71]     |     | 1 | AT     | [65, 84]     |    3 |
| 4 | GA     | [71, 65]     |     | 6 | CA     | [67, 65]     |    4 |
| 5 | AC     | [65, 67]     |     | 0 | GA     | [71, 65]     |    5 |
| 6 | CA     | [67, 65]     |     | 4 | GA     | [71, 65]     |    5 |
| 7 | A      | [65, -1]     |     | 2 | TA     | [84, 65]     |    7 |
+---+--------+--------------+     +---+--------+--------------+------+

Iteration 2, suffixes of length 4 are considered. Before and after sorting:

+---+--------+--------------+     +---+--------+--------------+------+
| i | Suffix | ranking_pair |     | i | Suffix | ranking_pair | Rank |
+---+--------+--------------+     +---+--------+--------------+------+
| 7 | A      | [0, 2]       |     | 7 | A      | [0, 2]       |    0 |
| 5 | AC     | [1, 3]       |     | 5 | AC     | [1, 3]       |    1 |
| 3 | AGAC   | [2, 4]       |     | 3 | AGAC   | [2, 4]       |    2 |
| 1 | ATAG   | [3, 5]       |     | 1 | ATAG   | [3, 5]       |    3 |
| 6 | CA     | [4, 5]       |     | 6 | CA     | [4, 5]       |    4 |
| 0 | GATA   | [5, 7]       |     | 4 | GACA   | [5, 6]       |    5 |
| 4 | GACA   | [5, -1]      |     | 0 | GATA   | [5, -1]      |    6 |
| 2 | TAGA   | [7, -1]      |     | 2 | TAGA   | [6, -1]      |    7 |
+---+--------+--------------+     +---+--------+--------------+------+

Iteration 3 is not shown, suffixes of length 8 are considered.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

后缀数组的实际实现

问题

答案1

将包含字符串、整数和数组的JSON对象解析为映射。

当转换天真递归硬币问题时出现记忆化错误。

Generate all permutations in go

如何使用jq将这个对象转换为这个对象？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论