英文:
A succinct binary tree of sums, array construction and addressing
问题
使用'sum'作为某个任意计算的简写。我有一个计算过程,通过递归地对值对进行求和,从而计算出一个单个的和。未配对的值会保持不变地向上提升,直到它们可以配对。
在这个计算过程中,我正在寻找平衡计算的最佳方法(即访问数组元素/节点所需的操作数),以及在一维数组中对所有节点进行最简洁的编码(即没有间隙、空值或重复值),最好不需要额外的索引数组,因为它不能从简洁的编码中轻松地派生出来,这样它就必须与数组一起保存。
虽然以下是简单的例子,但实际上初始列表中的值的数量可能非常大(2^47或更多)。
例如,给定列表[1, 2, 3, 4],数组非常简单:[10, 3, 7, 1, 2, 3, 4],并且可以很好地分成易于按节点或整行引用的行。
但对于一个包含5个项的列表,树的结构如下:
树 1
15
/ \
/ \
/ \
/ \
10 5
/ \ / \
3 7 5 -
/ \ / \ / \ / \
1 2 3 4 5 - - -
标准映射 left = i*2+1
, right = i*2+2
给出了以下数组:
数组 1
[ 15, 10, 5, 3, 7, 5, nil, 1, 2, 3, 4, 5, nil, nil, nil]
该数组有4个空值,并且列表中的最后一个元素'5'重复了2次。
为了改进这个问题,我们可以隐含表示'5'的重复,并移除空值:
数组 2
[15, 10, 3, 7, 1, 2, 3, 4, 5]
这个数组更加紧凑。树的结构是相同的,但在概念上看起来有点像:
树 2
15
/ \
/ \
10 \
/ \ \
3 7 \
/ \ / \ \
1 2 3 4 5
在数组 2的编码中,我有4行:
1. [1, 2, 3, 4]
2. [3, 7]
3. [10, 5]
4. [15]
行1、2和4可以简单地作为对数组 2的引用,允许我在原地计算结果,无需分配或复制。非常快速。然而,行3包含两个不连续的单元格中的值。我必须打破用于其他行的简单逻辑,并可能添加复制、索引或存储以进行映射。
我可以构建完整/平衡的子树(例如索引1-7,即1、2、3、4的树),但当奇数个项出现在不同的行上时,它们似乎不总是如此对齐。例如,考虑一个初始列表包含6个元素的树。
英文:
Using 'sum' as a short hand for some arbitrary computation. I have a process that computes a single sum from a list of values by recursively summing pairs of values. Un-paired values are promoted up the tree unaltered until they can be paired.
Given this computation, I'm in search of the best way to balance computation (i.e. number of operation required to access array elements/nodes), and the most succinct encoding of all nodes in a 1 dimensional array (i.e. no gaps, nil values, or repeated values), and preferably without an additional index array that cannot be easily derived from the succinct encoding so that it would have to be saved along with the array.
Although the following are simple examples, in reality the number of values in the initial list can be extraordinarily large (2^47 or more).
For example, given the list [1, 2, 3, 4], the array is trivial: [10, 3, 7, 1, 2, 3, 4], and split nicely into rows that are easy to address by node, or as a reference to the entire row.
But for a 5 item list the tree looks like this:
Tree 1
15
/ \
/ \
/ \
/ \
10 5
/ \ / \
3 7 5 -
/ \ / \ / \ / \
1 2 3 4 5 - - -
The standard mapping left = i*2+1
, right = i*2+2
gives us this array:
Array 1
[ 15, 10, 5, 3, 7, 5, nil, 1, 2, 3, 4, 5, nil, nil, nil]
This array has 4 nil values, and the last element in the list '5' is repeated 2 times.
To improve this we can imply the repetition of the 5, and remove the nil values:
Array 2
[15, 10, 3, 7, 1, 2, 3, 4, 5]
Which is much more compact. This tree is the same, but conceptually looks a bit like:
Tree 2
15
/ \
/ \
10 \
/ \ \
3 7 \
/ \ / \ \
1 2 3 4 5
In the Array 2 encoding I have 4 rows:
1. [1, 2, 3, 4]
2. [3, 7]
3. [10, 5]
4. [15]
Rows 1, 2 and 4 can simply be references into Array 2 allowing me to compute results 'in-place' with no allocations or copies. Very fast. Row 3 however, contains values in two non-contiguous cells. I have to break the simple logic used for the other rows, and possibly add copy, indexing or storage for a map.
I can construct complete/balanced sub trees (such as indexes 1-7, the tree for 1, 2, 3, 4), but it seems like they will not always be so nicely aligned when the odd number of items appears at different rows depending on input length. For example consider a tree with an initial list of 6 elements.
答案1
得分: 1
假设你的树在最后一行(最多的节点)上有N个节点。
如果你存储只向上传播的节点,那么你的树的总节点数在2N-1和2N-1+log2(N)之间。准确的总节点数由OEIS A120511给出。其中,最多有floor(2 + log2(N-1))个节点被复制/传播。
树有floor(2 + log2(N-1))行。行数作为N(最后一行的元素数)的函数是OEIS A070941。
这种树的行数非常低。例如,如果最后一行有2^40 ≈ 1,000,000,000,000个节点,那么树中只有42行。对于2^64个节点,只有66行。因此,如果你需要每行进行一些操作,开销并不高。
一个简单的对数时间函数可以计算出行数和总节点数,给定最后一行的节点数N:
# 考虑根节点
rows = 1
total = 1
curr_left = N
While (curr_left > 1):
rows = rows + 1
total = total + curr_left
curr_left = (curr_left + 1) / 2
End While
其中/
表示整数除法,即舍弃/截断/向零舍入任何小数部分。同样,对于最后一行有2^64个节点,上述循环只执行65次。
当我们知道树中的总节点数和行数时,我们可以使用另一个对数时间循环来计算树的每一行第一个元素的偏移量和该行的节点数:
first_offset = []
nodes = []
curr_row = rows - 1
curr_offset = total - N
curr_left = N
While (curr_left > 1):
nodes[curr_row] = curr_left
first_offset[curr_row] = curr_offset
curr_row = curr_row - 1
curr_left = (curr_left + 1) / 2
curr_offset = curr_offset - curr_left
}
first_offset[0] = 0
nodes[0] = 1
与前面一样,对于最后一行有2^64个节点,上述循环只执行65次。
树上的每个行的元素在内存中是连续的,如果我们使用从零开始的索引,并且N是最后一行的节点数,并且应用上述方法,则:
-
rows
是树的行数 -
total
是树的总节点数 -
如果
r >= 0
且r < rows
,则第r
行有nodes[r]
个节点 -
行
r
、列c
上的节点的数组索引是first_offset[r] + c
-
行
r
、列c
上的节点(其中r > 0
)在行r-1
、列c/2
上有一个父节点,其数组索引为first_offset[r-1] + c/2
-
行
r
、列c
上的节点(其中r < rows - 1
)在行r+1
、列2*c
上有一个左子节点,其数组索引为first_offset[r+1] + 2*c
-
行
r
、列c
上的节点(其中r < rows - 1
且c < nodes[r] - 1
)在行r+1
、列2*c+1
上有一个右子节点,其数组索引为first_offset[r+1] + 2*c + 1
-
行
r
、列c
上的节点(其中r < rows - 1
且c < nodes[r] - 1
)同时有左子节点和右子节点
该数组是紧凑的,除了向上传播的节点(因此,对于大小为1TB的数据集,可能只有几十个节点)外,不浪费任何存储空间。
如果将最后一行的节点数与数组一起存储(例如,在数组数据之后作为额外的uint64_t
),所有读取器都可以恢复total
、rows
、first_offset[]
和nodes[]
,并轻松导航树。(但是,请注意,使用的不仅仅是数组索引,还使用“列”和“行”,并使用它们来推导出数组索引。)
由于first_offset[]
和nodes[]
数组最多只有几十个条目,它们应该在缓存中保持热度,并且使用它们不会影响性能。
请注意,并非所有的树大小都符合上述第二段中的规则。例如,只有两个节点的树是没有意义的:为什么要复制根节点?
如果你知道树的大小(total
)是有效的,你可以使用二分搜索在O(log2(total)log2log2(total))的时间复杂度内基于total
找到N
,或者如果使用简单循环,则为O((log2(total))²)。记住,total
在2N-1和2*N-1+log2(N)之间。反过来,N
不能大于(N + 1)/2
,也不能小于(N + 1)/2 - log2(total)
,因为total
大于N
,所以log2(N)
小于log2(total)
。因此,二分搜索可以实现为:
Function Find_N(total):
Nmax = (total + 1) / 2
Nmin = Nmax - log2(total)
t = Total(Nmin)
If t == total:
Return Nmin
Else if t < total:
Return "Bug!"
End if
t = Total(Nmax)
if t == total:
Return Nmax
Else if t > total:
Return "Bug!"
End if
Loop:
N = (Nmin + Nmax) / 2
If N == Nmin:
Return "Invalid tree size!"
End If
t = Total(N)
If t > total:
Nmax = N
Else if t < total:
Nmin = N
Else:
return N
End If
End Loop
End Function
请记住,即使在树中有2^64个节点,上述函数最多只调用Total
函数1 + log2(64)
= 6次。由于通常每个程序调用只需要一次,因此开销实际上是可以忽略的。
你可以使用log(x)/log(2)
计算log2(x)
,使用<math.h>
中的log2()
函数(但由于double
的精度小于uint64_t
,我会将结果加上+1
,或者使用ceil()
向正无穷大舍入),甚至可以使用一个简单的循环:
Function ulog2(value):
result = 0
While (value > 0):
result = result + 1
value = value / 2
End While
Return result
End Function
其中再次,/
表示整数除法。
英文:
Let's assume your tree has N
nodes on the final (most numerous) row.
If you do store the nodes that are only propagated upwards, your tree has between 2*N-1
and 2*N-1+log2(N)
nodes, total. The exact total number of nodes is given by OEIS A120511. Of these, at most floor(2 + log2(N-1))
are copied/propagated nodes.
The tree has floor(2 + log2(N-1))
rows. The number of rows as a function of N
(the number of elements on the final row) is OEIS A070941.
The number of rows in such trees is quite low. For example, if you have 2<sup>40</sup> ≈ 1,000,000,000,000 nodes in the final row, you only have 42 rows in the tree. For 2<sup>64</sup> nodes, you have just 66. Therefore, if you need some operation per row, it is not a high overhead.
A simple logarithmic-time function can compute the number of rows and the total number of nodes, given the number of nodes in the final row N
:
# Account for the root node
rows = 1
total = 1
curr_left = N
While (curr_left > 1):
rows = rows + 1
total = total + curr_left
curr_left = (curr_left + 1) / 2
End While
where /
denotes integer division, i.e. any fractional part is discarded/truncated/rounded towards zero. Again, for 2<sup>64</sup> nodes in the final row, the above loops only 65 times.
When we know the total number of nodes in the tree, and the number of rows, we can use another logarithmic-time loop to compute the offset of the first element on each row of the tree, and the number of nodes on that row:
first_offset = []
nodes = []
curr_row = rows - 1
curr_offset = total - N
curr_left = N
While (curr_left > 1):
nodes[curr_row] = curr_left
first_offset[curr_row] = curr_offset
curr_row = curr_row - 1
curr_left = (curr_left + 1) / 2
curr_offset = curr_offset - curr_left
}
first_offset[0] = 0
nodes[0] = 1
As before, for 2<sup>64</sup> nodes in the final row, the above loops only 65 times.
All elements on a row are consecutive in memory, and if we use zero-based indexing, and N
is the number of nodes on the final row, and we apply the above, then
-
rows
is the number of rows in the tree -
total
is the total number of nodes in the tree -
There are
nodes[r]
nodes on rowr
, ifr >= 0
andr < rows
-
Array index for node on row
r
, columnc
isfirst_offset[r] + c
-
Node on row
r
, columnc
, withr > 0
, has a parent on rowr-1
, columnc/2
, at array indexfirst_offset[r-1] + c/2
-
Node on row
r
, columnc
, withr < rows - 1
, has a left child on rowr+1
, column2*c
, at array indexfirst_offset[r+1] + 2*c
-
Node on row
r
, columnc
, withr < rows - 1
andc < nodes[r] - 1
, has a right child on rowr+1
, column2*c+1
, at array indexfirst_offset[r+1] + 2*c + 1
-
Node on row
r
, columnc
, withr < rows - 1
andc < nodes[r] - 1
, has both a left and a right child
This array is compact, and other than the nodes that get propagated upwards (so, maybe a few dozen nodes for a terabyte-sized dataset), wastes no storage.
If the number of nodes in the final row is stored with the array (for example, as an extra uint64_t
following the array data), all readers can recover total
, rows
, first_offset[]
, and nodes[]
, and easily navigate the tree. (However, note that instead of just the array index, you use the "column" and "row" instead, and derive the array index using those.)
Because first_offset[]
and nodes[]
arrays have at most a few dozen entries, they should stay hot in caches, and using them should not harm performance.
Note that not all tree sizes are valid for the rules stated in the second paragraph above. For example, a tree with two nodes makes no sense: why would you duplicate the root node?
If you do know that the tree size (total
) is valid, you can find N
based on total
in O(log2(total)*log2log2(total))
time complexity using a binary search, or in O((log2(total))²)
if you use a simple loop. Remember, total
is between 2*N-1
and 2*N-1+log2(N)
. Conversely, N
cannot be greater than (N + 1)/2
, or smaller than (N + 1)/2 - log2(total)
, because total
is greater than N
, and therefore log2(N)
is less than log2(total)
. So, a binary search could be implemented as
Function Find_N(total):
Nmax = (total + 1) / 2
Nmin = Nmax - log2(total)
t = Total(Nmin)
If t == total:
Return Nmin
Else if t < total:
Return "Bug!"
End if
t = Total(Nmax)
if t == total:
Return Nmax
Else if t > total:
Return "Bug!"
End if
Loop:
N = (Nmin + Nmax) / 2
If N == Nmin:
Return "Invalid tree size!"
End If
t = Total(N)
If t > total:
Nmax = N
Else if t < total:
Nmin = N
Else:
return N
End If
End Loop
End Function
Keep in mind that even with 2<sup>64</sup> nodes in the tree, the above function makes at most 1 + log2(64)
= 6 calls to Total
, a function implementing the first pseudocode snippet in this answer. Since you typically need this only once per program invocation, the overhead is truly irrelevant.
You can calculate log2(x)
using log(x)/log(2)
, using the log2()
function from <math.h>
since C99 (but since double
has less precision than uint64_t
, I would add +1
to the result, or round it towards positive infinity using ceil()
, just to be sure), or even using a simple loop:
Function ulog2(value):
result = 0
While (value > 0):
result = result + 1
value = value / 2
End While
Return result
End Function
where once again, /
denotes integer division.
答案2
得分: 0
听起来你正在询问二叉树的简洁编码方法。对于这个问题,只需使用前序遍历来存储节点数据,并添加一个额外的位来表示叶子节点和内部节点。编码和解码的算法非常简单,在维基百科的一篇文章中有详细介绍,所以我就不在这里重复了。这种编码方法非常接近于信息论中最佳的可能性。可能没有必要再寻找更好的编码方法。
英文:
It sounds like you're asking for a succinct encoding of a binary tree. For this it's sufficient to store the node data in pre-order with one additional bit to denote leaf vs internal node. The algorithms for encoding and decoding are rather simple and given in a Wikipedia article, so I won't repeat them here. This encoding is very close to the information-theoretic best possible. It's probably not worth looking for a better one.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论