英文:
Cluster a stream of items with constraints
问题
我正在寻找一种将输入的项目序列(不可重复的流)进行分区的方法。我确定这是来自图论的某种标准k-部分算法,但我记不起它的名称/方法 - 你能帮忙吗?
每个项目的形式是(x1, x2, x3, x4, …, xn)
,即一个包含n
个组件的元组,其中x1
来自可能的值集合X1
。
示例:n=3
,X1={A, B, C, -}
,X2={0, 1, 2, -}
,X3={green, red, blue, -}
。示例项目:(A, 0, green)
,(C, 2, blue)
,(A, 2, green)
,(A, -, -)
,(-, 1, red)
,等等。
-
值是一个特殊值,表示“此值尚未知道”。每个项目至少有一个已知的值,即(-, -, -)
是不可能的。
现在,聚类必须将所有在其值上没有冲突的项目分组在一起。例如,(A, -, blue)
可以与(A, 2, -)
分组,因为它们唯一的重叠在于第一个组件,其中两者共享相同的值A
=> 这两个项目应该最终位于同一群中。
但是(A, -, blue)
不能与(B, -, blue)
分组,因为存在冲突A != B
。
类似地,(A, -, -)
不能直接与(-, -, green)
分组,因为它们没有已知的共同值。
聚类必须是可传递的,即如果有另一个项目(A, -, green)
将它们连接起来,(A, -, -)
和(-, -, green)
应该最终位于同一群中。这三个项目都应该成为同一群的一部分。
在实践中,n
大约为10,每个Xi
可能值集合的组件i
都在数十亿范围内。因此,所有项目的潜在集合实际上是无限的。我有一个包含几亿个这样的项目的输入流,需要在它们进入时迅速分组在一起。
如果上述约束条件不导致唯一的聚类解决方案,贪婪算法是可以的。但必须满足以下约束:
- 一个群集中的两个项目不能包含冲突的值。
- 不同的群集在至少一个冲突值上不同。
- 为了合并两个项目,它们必须共享至少一个值。
英文:
I'm looking to partition a sequence (non-repeatable stream) of items coming in. I'm sure this is some standard k-partite algo from graph theory, but I cannot remember its name / approach – can you help?
Each item is of the form (x1, x2, x3, x4, …, xn)
, i.e. a tuple of n
components, where x1
comes from a set of possible values X1
.
Example: n=3
, X1={A, B, C, -}
, X2={0, 1, 2, -}
, X3={green, red, blue, -}
. Example items: (A, 0, green)
, (C, 2, blue)
, (A, 2, green)
, (A, -, -)
, (-, 1, red)
, etc.
The -
value is a special value that means "this value is not known yet". Each item has at least one value known, i.e. (-, -, -)
is not possible.
Now the clustering must group together all items that have no conflict in their values. For example (A, -, blue)
can be grouped with (A, 2, -)
because the only overlap is in the first component, where both share the same value A
=> both items should end up in the same cluster.
But (A, -, blue)
cannot be grouped with (B, -, blue)
because there's a conflict A != B
.
Similarly (A, -, -)
cannot be directly grouped with (-, -, green)
because they share no known common value.
The clustering must be transitive, in the sense that (A, -, -)
and (-, -, green)
should end up in the same cluster if there's another item (A, -, green)
that connects them. All three should become a part of the same cluster.
In practice, n is ~10 and each Xi
set of possible values for component i
is in the billions. So the potential set of all items is practically infinite. I have an input stream of a few hundred million of such items that I need to group together quickly, as they come in.
A greedy algo is OK, in case the constraints above don't lead to a unique clustering solution. But these constraint must be satisfied:
- No two items in a cluster contain conflicting values.
- Different clusters differ in at least one conflicting value.
- In order to merge two items, they must share at least one value.
答案1
得分: 1
确实,聚类解决方案不一定唯一。考虑(A,-,蓝色)、(A,2,-)和(A,3,-)。第一个可以与最后两个中的任何一个配对,但最后两个之间存在冲突。
这个问题被称为流数据聚类(维基链接)或在线聚类(一个用于搜索算法的有用查询术语)。
对于一种贪婪的方法,你可以在给定一个新项时,只需选择你已经拥有的第一个可接受的聚类。由于不允许冲突,每个已有的聚类可以用一个单一的元组来表示。一个元组可能一开始包含任意数量的未知,例如(A, -,-)。任何新项都必须与此兼容,因此聚类表示可以更新为例如(A, -,绿色),然后稍后是(A, 14, 绿色)。
然后算法是将新项与当前的聚类表示进行比较(这个子步骤可能需要和/或允许通过使用适当的数据结构来进行优化),如果不冲突,则将其添加到现有聚类中(如果需要,则更新表示),否则启动一个新的聚类。
除非您对聚类过程有更具体的要求,否则我认为您不需要在这里使用图论。一个方面是您是否可以使用整数来编码所有集合 - 如果可能的话,这将比处理字符串比较更快。
英文:
Indeed, the clustering solution is not necessarily unique. Consider (A,-,blue) (A,2,-) and (A,3,-). The first pairs with either of the last, but the last two are in conflict.
This problem is known as stream clustering (wiki link) or online clustering (a useful query term to search for algorithms).
For a greedy approach, you could, given a new item, suffice with picking the first admissible cluster that you already have. Each cluster that you have can be represented by a single tuple since you allow no conflicts. A tuple may start out with any number of unknowns, e.g. (A, -, -). Any new item must be compatible with this, so the cluster representation can be updated to e.g. (A, -, green) and then later (A, 14, green).
The algorithm is then to check a new item against the current cluster representations (this sub-step may require and/or allow optimisation by using appropriate datastructures), add it to an existing cluster (and update the representation if needed) if it is not conflicting, start a new cluster otherwise.
Unless you have more specific requirements on the clustering process I don't think you need any graph-theory here. One aspect is whether you can encode all your sets using integers - if possible this will be quicker than dealing with string comparisons.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论