一个对象集合的唯一标识符

huangapple go评论86阅读模式
英文:

Unique Id for a collection of objects

问题

请注意,代码部分不需要翻译,以下是您要翻译的文本部分:

假设我有一个三级的树结构:

然后,我有另一个过程(TreeToPathProcess)用于计算/生成从根到叶子的4条路径:

现在,我正在公开一个端点,我的客户端要求为每条路径分配一个ID(太复杂以至于无法解释整个要求)。我必须能够为每条路径关联一个ID,并且对于相同的路径始终使用相同的ID。

示例:

第一次运行TreeToPathProcess时,它生成了以下结果:

ID=100,路径=A->B->C
ID=101,路径=A->B->D
ID=102,路径=A->E->F
ID=103,路径=A->E->G

下一次运行相同的过程时,我需要获得与之前相同的结果。

这是否可能?

您有什么想法可以帮助我解决这个问题?

只是为了澄清,在我的真实路径中,我有数百个元素。

英文:

Imagine that I have a tree of three levels

    |->B->C
A-->|  |->D
    |------->E->F
             |->G

Then, I have another process (TreeToPathProcess) that calculates/generates me 4 paths from the root to the leaves :

A->B->C
A->B->D
A->E->F
A->E->G

Now, I am exposing an endpoint where my clients ask for an id for each path (too complicated to explain the whole requirement). I have to be able to associate an Id to each path, and always the same id for the same path.

Exemple :

The first time I run TreeToPathProcess it generates me the following :

ID=100, path=A->B->C
ID=101, path=A->B->D
ID=102, path=A->E->F
ID=103, path=A->E->G

Next time I run the same process, I need to get the same result as precedent.

Is that possible ?

Have you any idea that helps me to solve this issue ?

Just to clarify, in my real paths I get many hundred of elements.

答案1

得分: 2

基础信息学前言

我们这里立即遇到了一个鸽子洞原理的问题。你可能正在考虑这些ID的特定格式(也许是一个int)。有些格式(比如intlongdouble和固定宽度的String)有一组有限的值,比如int有40亿个值,但不会超过这个范围:从-2^31到+2^31。虽然40亿是很多,但你的树结构有一个_无限_的空间(这意味着:任何特定的树结构都可以有无限数量的节点。这并不意味着所有的树都是无限大的),因此,在理论上,你想要的是__不可能的__ - 你不能将一个无限的宇宙映射到一个有限的宇宙中;不会发生重叠/错误。

在实践中,相同的原理也适用于UUID(128位,因此是有限的宇宙),这是由各种各样无关的方面用于无限的宇宙。然而,UUID是可以接受的;没有人会抱怨错误或碰撞。救星在于实用主义:两台计算机都生成随机的128位值最终得到相同值的几率要比宇宙射线引起内存中位翻转的几率小得多,几乎不可能发生,你应该担心其他事情。即使__全世界__一直在生成UUIDs,也没有冲突。

不过,这里有一个重要的区别:你希望你的ID在理论上能够覆盖甚至是无限大的树(这意味着几乎可以只用ID列表完全重建整个树,毕竟,如果对于任何想象得到的树节点都有一个唯一的映射到ID的映射,那么我可以将任何ID“映射回”到确切的树节点,而无需拥有该树)。这只是一些极其基本的信息学专业术语,让你得出一个你可能最初会觉得不喜欢的结论:

最好的答案很可能涉及到一个不能实际保证唯一性的散列算法;但在所有实际目的下,它是唯一的

上面的专业术语只是强调,任何试图采取“好吧,如果你只是......”来绕过这个限制的尝试,只是表明你对那个句子中的“只是”有多么难以理解。

将树路径映射到ID

这将我们带到了正确的答案(或者,更确切地说,如果这不正确,你想要的东西要么非常复杂,除非树路径有严重的限制,比如“深度永远不超过4,每个节点可以由单个大写ASCII字母描述” - 在这种情况下,你需要提出一个新问题并将这些限制添加到其中):

你首先以独特的方式描述路径。如果路径节点实际上是“字符串”,你可以简单地用斜杠将这些字符串连接起来,例如将你在片段中拥有的内容转换为字符串"/A/B/C"

然后,对该字符串进行_散列_。你可以使用Java的字符串散列器(只需"/A/B/C".hashCode()),它会产生32位的输出。或者,使用一个更为健壮的散列算法,比如SHA-256。它会产生(因此得名)256位的输出,这相当大(将其用Base64编码,最终你会得到43个字符长度的字符串)。如果你希望它更小,只需将位范围亦或(XOR)在彼此之上,例如,如果你想要128位的输出,只需将顶部128位与底部128位进行XOR运算。用Base64编码后,你会得到12个字符。如果你希望ID是数字 - 最简单的答案是再次应用这个“亦或在彼此之上”,最终得到64位,这适合在一个long中。

这__并不能__保证唯一性。但在实践中,除非有人有意地选择已知会发生碰撞的字符串,否则它将是唯一的。如果你想在某种程度上抵御这种情况,可以帮助将一些附加的静态数据(如路径深度)散列进去(即将"/1A/2B/3C"散列起来)。

看起来像是这样:

public long pathToId(List<Node> nodesInPath) {
  Charset charset = StandardCharsets.UTF_8;
  MessageDigest digest = MessageDigest.getInstance("SHA-256");
  for (int i = 0; i < nodesInPath.size(); i++) {
    digest.update((byte) '/');
    digest.update((byte) i);
    digest.update(nodesInPath.get(i).getName().getBytes(charset));
  }
  byte[] h = digest.digest();
  long out = 0;
  // 将所有字节散列到前8个字节中(long有8个字节)
  for (int i = 8; i < h.length; i++) h[i % 8] ^= h[i];
  return ByteBuffer.wrap(h).getLong();
}

当然,请注意,前言也有另一方面的作用:通过散列,从散列值(一个ID)返回到指示的树路径的工作

英文:

Fundamental informatics preamble

We immediately have a bit of a pigeonhole principle issue here. Presumably you're thinking about a specific format for these IDs (perhaps, an int). Some formats (int, long, double, and fixed-width String, for example) have a delimited set of values - for example, int has 4 billion values but no more than that: From -2^31 up to +2^31. Whilst 4 billion is a lot, your tree structure has an infinite space (that means: Any particular tree structure could have any of an infinite amount of nodes in it. Not that all trees are infinitely large), and therefore, in theory what you want is therefore impossible - you can't map an infinite universe onto a finite one; not without overlap / error.

In practice, the same principle applies to UUIDs (128-bit, so, finite universe), which is used by all sorts of unrelated parties for infinite universes. And yet, UUIDs are fine; nobody is complaining about errors or collisions. The saving grace is pragmatics: The odds that 2 computers both rolling up a random 128-bit value ending up at the same value are many, many orders of magnitude smaller than errant cosmic rays causing bit flips in your memory banks and cuasing problems that way; it's just not going to happen / so unlikely, you should be worrying about other things. Even with all the world rolling up UUIDs all the time, no colissions.

Still, it's an important distinction here: Do you want your IDs to theoretically be able to cover even infinite trees (which implies more or less that a list of IDs is all you need to completely reconstitute the entire tree, after all, if there is a unique mapping for any imaginable tree node to an ID, then therefore I can 'map back' any ID back to the exact tree node, without having the tree). This is all a lot of highly fundamental informatic wonkery to push you towards a conclusion that you might initially find distasteful:

The best answer is highly likely to involve a hashing algorithm that cannot actually guarantee uniqueness; but for all practical purposes, is unique.

The above wonkery is just highlighting that any attempt to go: "Well, if you just...." to get around that restriction, is merely indicative of a lack of understanding of just (heh) how much difficult is hiding behind "just" in that sentence.

Mapping tree paths to IDs

This gets us to the right answer (or, rather, if this isn't right, what you do want is either incredibly complicated unless tree paths have serious limitations, such as 'never has a depth more than 4 and each node can be described by a single uppercase ASCII letter' - in which case you need to ask a new question and add these limits to it):

You first describe the path in a unique fashion. If path nodes are literally 'strings', you can for example simply concatenate these strings with slashes in between, i.e. turn what you have in your snippet is ID=100 as the string "/A/B/C".

Then, hash that string. You could use java's string hasher (just "/A/B/C".hashCode()), which produces 32-bit output. Alternatively, use a hash algorithm that is somewhat more robust, such as SHA-256. This produces (hence the name) 256-bit output which is rather large (encoding that using Base64, you end up with 43-length string). If you want it smaller, just XOR bit ranges on top of each other, e.g. if you want 128-bit output just XOR the top 128 bit onto the bottom 128 bit. Base64-ing that, you end up with 12 characters. If you want IDs to be numeric - easiest answer is to apply this 'XOR on top of each other' one more time, to end up with 64 bits, which fits in a long.

It does not guarantee uniqueness. However, in practice, it will be, unless someone is messing with you and intentionally picking strings that are known to collide. If you want to fight this somewhat, it can help to hash in some additional static data, such as the path depth (i.e. hash up "/1A/2B/3C").

This looks something like:

public long pathToId(List<Node> nodesInPath) {
  Charset charset = StandardCharsets.UTF_8;
  MessageDigest digest = MessageDigest.getInstance("SHA-256");
  for (int i = 0; i < nodesInPath.size(); i++) {
    digest.update((byte) '/');
    digest.update((byte) i);
    digest.update(nodesInPath.get(i).getName().getBytes(charset));
  }
  byte[] h = digest.digest();
  long out = 0;
  // hash all bytes into the first 8 (longs are 8 bytes)
  for (int i = 8; i < h.length; i++) h[i % 8] ^= h[i];
  return ByteBuffer.wrap(h).getLong();
}

Note, of course, that the preamble cuts both ways: By hashing, you make the job of going from a hash value (an ID) back to the indicated tree path difficult - essentially you have to calculate all IDs first. Whichever tree path has the same ID - that's the path they meant with that ID. Presumably that's what you wanted.

I want perfect mapping

Okay, just, shove slashes in between. Voila. Your path with id 100 should turn into the ID "/A/B/C". Yes, this means the entire path data is in the 'ID' and the ID can theoretically be infinitely large. Certainly they aren't compact. But, if you don't like it - reread the Fundamental informatics preamble which proves you can't have perfect mapping to short IDs.

答案2

得分: -1

使用每个节点附加一个UUID来构建您的路径如何?实在是太长了,是吗?

为什么不使用随机生成的独特字符串,比如说使用5-10个字符之间?

可以从每个节点的随机生成的字符串构建一个唯一的路径:

路径 : "/"+node1.rsg1+"/"+node2.rsg2+"/"+node3.rsg3 // rsg: 随机生成的字符串

对于只有3级的树,我认为您的用例不会成为阻碍。

英文:

Why not using a UUID attached on each node to build your path ?
Too much long, really ?

Why not using a randomly unique generated string then, lets say using between 5-10 characters ?

An unique path could be build from each node randomly generated string :

path : "/"+node1.rsg1+"/"+node2.rsg2+"/"+node3.rsg3 // rsg: randomly generated string

For a tree of only 3 levels, I don't think your use case as a blocking stuff.

huangapple
  • 本文由 发表于 2023年7月18日 16:38:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/76710924.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定