Imagine that I have a tree of three levels

A-->|  |->D

Then, I have another process (TreeToPathProcess) that calculates/generates me 4 paths from the root to the leaves :


Now, I am exposing an endpoint where my clients ask for an id for each path (too complicated to explain the whole requirement). I have to be able to associate an Id to each path, and always the same id for the same path.

Exemple :

The first time I run TreeToPathProcess it generates me the following :

ID=100, path=A->B->C
ID=101, path=A->B->D
ID=102, path=A->E->F
ID=103, path=A->E->G

Next time I run the same process, I need to get the same result as precedent.

Is that possible ?

Have you any idea that helps me to solve this issue ?

Just to clarify, in my real paths I get many hundred of elements.


得分: 2


我们这里立即遇到了一个鸽子洞原理的问题。你可能正在考虑这些ID的特定格式(也许是一个int)。有些格式(比如intlongdouble和固定宽度的String)有一组有限的值,比如int有40亿个值,但不会超过这个范围:从-2^31到+2^31。虽然40亿是很多,但你的树结构有一个_无限_的空间(这意味着:任何特定的树结构都可以有无限数量的节点。这并不意味着所有的树都是无限大的),因此,在理论上,你想要的是__不可能的__ - 你不能将一个无限的宇宙映射到一个有限的宇宙中;不会发生重叠/错误。






这将我们带到了正确的答案(或者,更确切地说,如果这不正确,你想要的东西要么非常复杂,除非树路径有严重的限制,比如“深度永远不超过4,每个节点可以由单个大写ASCII字母描述” - 在这种情况下,你需要提出一个新问题并将这些限制添加到其中):


然后,对该字符串进行_散列_。你可以使用Java的字符串散列器(只需"/A/B/C".hashCode()),它会产生32位的输出。或者,使用一个更为健壮的散列算法,比如SHA-256。它会产生(因此得名)256位的输出,这相当大(将其用Base64编码,最终你会得到43个字符长度的字符串)。如果你希望它更小,只需将位范围亦或(XOR)在彼此之上,例如,如果你想要128位的输出,只需将顶部128位与底部128位进行XOR运算。用Base64编码后,你会得到12个字符。如果你希望ID是数字 - 最简单的答案是再次应用这个“亦或在彼此之上”,最终得到64位,这适合在一个long中。



public long pathToId(List<Node> nodesInPath) {
  Charset charset = StandardCharsets.UTF_8;
  MessageDigest digest = MessageDigest.getInstance("SHA-256");
  for (int i = 0; i < nodesInPath.size(); i++) {
    digest.update((byte) '/');
    digest.update((byte) i);
  byte[] h = digest.digest();
  long out = 0;
  // 将所有字节散列到前8个字节中(long有8个字节)
  for (int i = 8; i < h.length; i++) h[i % 8] ^= h[i];
  return ByteBuffer.wrap(h).getLong();



Fundamental informatics preamble

We immediately have a bit of a pigeonhole principle issue here. Presumably you're thinking about a specific format for these IDs (perhaps, an int). Some formats (int, long, double, and fixed-width String, for example) have a delimited set of values - for example, int has 4 billion values but no more than that: From -2^31 up to +2^31. Whilst 4 billion is a lot, your tree structure has an infinite space (that means: Any particular tree structure could have any of an infinite amount of nodes in it. Not that all trees are infinitely large), and therefore, in theory what you want is therefore impossible - you can't map an infinite universe onto a finite one; not without overlap / error.

In practice, the same principle applies to UUIDs (128-bit, so, finite universe), which is used by all sorts of unrelated parties for infinite universes. And yet, UUIDs are fine; nobody is complaining about errors or collisions. The saving grace is pragmatics: The odds that 2 computers both rolling up a random 128-bit value ending up at the same value are many, many orders of magnitude smaller than errant cosmic rays causing bit flips in your memory banks and cuasing problems that way; it's just not going to happen / so unlikely, you should be worrying about other things. Even with all the world rolling up UUIDs all the time, no colissions.

Still, it's an important distinction here: Do you want your IDs to theoretically be able to cover even infinite trees (which implies more or less that a list of IDs is all you need to completely reconstitute the entire tree, after all, if there is a unique mapping for any imaginable tree node to an ID, then therefore I can 'map back' any ID back to the exact tree node, without having the tree). This is all a lot of highly fundamental informatic wonkery to push you towards a conclusion that you might initially find distasteful:

The best answer is highly likely to involve a hashing algorithm that cannot actually guarantee uniqueness; but for all practical purposes, is unique.

The above wonkery is just highlighting that any attempt to go: "Well, if you just...." to get around that restriction, is merely indicative of a lack of understanding of just (heh) how much difficult is hiding behind "just" in that sentence.

Mapping tree paths to IDs

This gets us to the right answer (or, rather, if this isn't right, what you do want is either incredibly complicated unless tree paths have serious limitations, such as 'never has a depth more than 4 and each node can be described by a single uppercase ASCII letter' - in which case you need to ask a new question and add these limits to it):

You first describe the path in a unique fashion. If path nodes are literally 'strings', you can for example simply concatenate these strings with slashes in between, i.e. turn what you have in your snippet is ID=100 as the string "/A/B/C".

Then, hash that string. You could use java's string hasher (just "/A/B/C".hashCode()), which produces 32-bit output. Alternatively, use a hash algorithm that is somewhat more robust, such as SHA-256. This produces (hence the name) 256-bit output which is rather large (encoding that using Base64, you end up with 43-length string). If you want it smaller, just XOR bit ranges on top of each other, e.g. if you want 128-bit output just XOR the top 128 bit onto the bottom 128 bit. Base64-ing that, you end up with 12 characters. If you want IDs to be numeric - easiest answer is to apply this 'XOR on top of each other' one more time, to end up with 64 bits, which fits in a long.

It does not guarantee uniqueness. However, in practice, it will be, unless someone is messing with you and intentionally picking strings that are known to collide. If you want to fight this somewhat, it can help to hash in some additional static data, such as the path depth (i.e. hash up "/1A/2B/3C").

This looks something like:

public long pathToId(List<Node> nodesInPath) {
  Charset charset = StandardCharsets.UTF_8;
  MessageDigest digest = MessageDigest.getInstance("SHA-256");
  for (int i = 0; i < nodesInPath.size(); i++) {
    digest.update((byte) '/');
    digest.update((byte) i);
  byte[] h = digest.digest();
  long out = 0;
  // hash all bytes into the first 8 (longs are 8 bytes)
  for (int i = 8; i < h.length; i++) h[i % 8] ^= h[i];
  return ByteBuffer.wrap(h).getLong();

Note, of course, that the preamble cuts both ways: By hashing, you make the job of going from a hash value (an ID) back to the indicated tree path difficult - essentially you have to calculate all IDs first. Whichever tree path has the same ID - that's the path they meant with that ID. Presumably that's what you wanted.

I want perfect mapping

Okay, just, shove slashes in between. Voila. Your path with id 100 should turn into the ID "/A/B/C". Yes, this means the entire path data is in the 'ID' and the ID can theoretically be infinitely large. Certainly they aren't compact. But, if you don't like it - reread the Fundamental informatics preamble which proves you can't have perfect mapping to short IDs.


得分: -1




路径 : "/"+node1.rsg1+"/"+node2.rsg2+"/"+node3.rsg3 // rsg: 随机生成的字符串



Why not using a UUID attached on each node to build your path ?
Too much long, really ?

Why not using a randomly unique generated string then, lets say using between 5-10 characters ?

An unique path could be build from each node randomly generated string :

path : "/"+node1.rsg1+"/"+node2.rsg2+"/"+node3.rsg3 // rsg: randomly generated string

For a tree of only 3 levels, I don't think your use case as a blocking stuff.

