Combining tuples based on transitive relation

huangapple go评论53阅读模式
英文:

Combining tuples based on transitive relation

问题

我有一个元组的列表:

[(10,22), (10,20), (10,69), (34,18), (18,17), (89,990), (86,80), (174,175), (543,542)]

我想要得到一个结果如下:

[(10,22,20,69), (34,18,17), (89,990), (86, 80), (174,175), (543,542)]

我想要将具有至少一个共同元素的元组分组在一起。我该如何做?

英文:

I have a list of tuples:

[(10,22), (10,20), (10,69), (34,18), (18,17), (89,990), (86,80), (174,175), (543,542)]

I'd like to obtain a result like this:

[(10,22,20,69), (34,18,17), (89,990), (86, 80), (174,175), (543,542)]

I want to group together any of the tuples that have at least one element in common. How would I do that?

答案1

得分: 4

以下是翻译的内容:

这段代码看起来像你想要传递闭包,但实际上你还需要一个对称闭包,以便可以合并(10,22)(10,20)(10,69)

def merge_sets(sets):
    for i in range(len(sets) - 1, 0, -1):
        for j in range(i - 1, -1, -1):
            if sets[i] & sets[j]: # 检查交集(非空 -> True)
                sets[j] |= sets[i] # 合并第i个集合到第j个集合
                sets.pop(i) # 移除第i个集合
                break # 终止内部循环,继续下一个i
    return sets

tuples = [(10,22), (10,20), (10,69), (34,18), (18,17), \
          (89,990), (86,80), (174,175), (543,542)]
sets = [set(t) for t in tuples]
merged_sets = merge_sets(sets)
merged_tuples = [tuple(s) for s in merged_sets]
# merged_tuples:
# [(20, 69, 22, 10), (17, 34, 18), (89, 990), (80, 86), (174, 175), (542, 543)]

这段代码很简单。它首先将元组转换为集合,以便我们可以方便地检查共享元素(集合交集)。最后,我们将集合转换回元组。

函数merge_sets将所有集合彼此比较。每当两个集合相交时,我们将它们合并并继续。

有三点值得进一步解释:

  • 我们可以安全地将第i个集合合并到第j个集合中,因为我们已经将前者与中间所有的集合进行了比较。 (如果我们以相反的方式做,我们将不得不比较我们已经部分比较的集合:在中间可能会有与第j个集合相交但与第i个集合不相交的集合。 为了得到一个最小的示例,请尝试输入[(1, 2), (3, 4), (2, 3)](从左到右遍历)或[(2, 3), (3, 4), (1, 2)](从右到左遍历,如我的代码中所示)。 因为我们正在以与循环进展方向相同的方向合并集合(这里是从右到左;选择在算法上是任意的,但请参见下一个要点),所以我们可以简单地继续循环,不必在每次合并后重新开始它们。
  • 两个循环都是倒数计数的,因为从右边删除Python list元素比从左边快。 (要从左边进行快速删除,我们可以使用collections.deque来代替。)
  • 为什么这个问题最好用传统的循环结构来解决(而不是更复杂的Python结构,比如列表推导或map)的原因是我们正在处理可变对象(我们将修改列表以及包含的项目)。 即使使用传统循环,我们也需要小心以正确的方式修改项目(通过在循环进展的方向上合并它们;请参见第一点)。

改进的解决方案:

因为在“最坏”的情况下,所有元组可能都是不同的,所以似乎没有办法避免彼此比较以查找相交元素,导致n*(n+1)/2次比较。但我们可以通过减少重复的部分比较来略微加快速度。例如,如果集合i合并到集合j,但中间还有另一个集合m与集合i相交,那么当达到集合m时,我们将不得不比较集合m与新集合j。但由于上面的循环工作方式,如果它们最初来自集合i,我们已经比较了集合j到集合m的那些元素。

为了避免这种部分重复,我们简单地将j移到外部循环(仍然是从右到左的遍历),然后标记与它相交的右侧集合。之后,我们将它们一起合并到集合j中。

def merge_sets(sets):
    for j in range(len(sets) - 2, -1, -1): # 从len(sets)-2到0的循环
        # 标记从len(sets)-1到j+1的索引,这些集合相交:
        merge_into_j = [i for i in range(len(sets) - 1, j, -1) if sets[j] & sets[i]]
        for i in merge_into_j: # merge_into_j包含从右到左的索引
            sets[j] |= sets[i] # 合并第i个集合到第j个集合
            sets.pop(i) # 移除第i个集合
    return sets

(注意列表推导可能在每次迭代中看到逐渐减小的len(sets)值。)


感谢用户Crazy Chucky提醒我,非空集合会评估为True

英文:

What was written [in the question's initial version, which was later edited] sounds like you want the transitive closure, but actually you need a symmetric closure as well, so that you can merge (10,22), (10,20), and (10,69).

def merge_sets(sets):
    for i in range(len(sets) - 1, 0, -1):
        for j in range(i - 1, -1, -1):
            if sets[i] & sets[j]: # check for intersection (non-empty -> True)
                sets[j] |= sets[i] # merge i-th set into j-th set
                sets.pop(i) # remove i-th set
                break # terminate inner loop and continue with next i
    return sets

tuples = [(10,22), (10,20), (10,69), (34,18), (18,17), \
          (89,990), (86,80), (174,175), (543,542)]
sets = [set(t) for t in tuples]
merged_sets = merge_sets(sets)
merged_tuples = [tuple(s) for s in merged_sets]
# merged_tuples:
# [(20, 69, 22, 10), (17, 34, 18), (89, 990), (80, 86), (174, 175), (542, 543)]

This code is straightforward. It first converts the tuples to sets, so that we can conveniently check for shared elements (set intersection). At the end, we convert the sets back to tuples.

The function merge_sets compares all sets to each other. Whenever two sets intersect, we merge them and continue.

Three points deserve further explanation:

  • We can safely merge the i-th set into the j-th set because we already compared the former to all the sets in between. (If we did it the other way round, we would have to compare sets we already partially compared: there could be sets in the middle that intersect with the j-th set but not with the i-th set. For a minimal example, try the input [(1, 2), (3, 4), (2, 3)] (left-to-right traversal) or [(2, 3), (3, 4), (1, 2)] (right-to-left traversal, as in my code).) Because we are merging sets in the same direction as the one in which the loops progress (here: right to left; the choice is algorithmically arbitrary, but see the next bullet point), we can simply continue with the loops and don't have to restart them at an earlier point after every merger.
  • Both loops count down because removing Python list elements is faster from the right than from the left. (To get fast removals from the left, we could instead use collections.deque.)
  • The reason why this problem is best tackled with traditional loop constructs (as opposed to fancier Python constructs, such as list comprehensions or map) is that we are dealing with mutable objects (we will modify both the list as well as the containing items). Even with traditional loops, we need to be careful to modify items in the right way (by merging them in the direction of the loop progression; see first bullet point).

Improved solution:

Because all tuples might be distinct in the "worst" case, there seems to be no way around comparing all of them (or their corresponding sets) to each other for intersecting elements, leading to n*(n+1)/2 comparisons. However, we can speed things up slightly by reducing the number of repeated partial comparisons. For example, if set i is merged into set j, but there is another set m in the middle that intersects with set i, we will, upon reaching set m, have to compare set m with the new set j. But due to the way the loops above work, we have already compared those elements from set j to set m if they originally came from set i.

To avoid such partial duplication, we simply move j to the outer loop (still with right-to-left traversal) and then mark the sets to the right of set j which intersect with it. Only afterwards, we then merge them all in one go into set j.

def merge_sets(sets):
    for j in range(len(sets) - 2, -1, -1): # loop from len(sets)-2 down to 0
        # mark indices from len(sets)-1 down to j+1 for which the sets intersect:
        merge_into_j = [i for i in range(len(sets) - 1, j, -1) if sets[j] & sets[i]]
        for i in merge_into_j: # merge_into_j contains indices in RTL direction
            sets[j] |= sets[i] # merge i-th set into j-th set
            sets.pop(i) # remove i-th set
    return sets

(Note how the list comprehension might see progressively smaller values of len(sets) upon each iteration.)


Credit goes to user Crazy Chucky for reminding me that non-empty sets evaluate to True.

答案2

得分: 2

以下是您要翻译的内容:

"我想知道是否可以通过借鉴Lover of Structure的答案中的思路,并将O(n^2)部分转换为整数运算来加速。这是一种尝试。在这个版本中,我不需要假设输入是一个集合列表:

from collections import defaultdict

tuples = [(10, 22), (10, 20), (10, 69), (34, 18), (18, 17),
          (89, 990), (86, 80), (174, 175), (543, 542)]
values_to_index = defaultdict(int)

# 这部分是对元素的单次遍历
for i, t in enumerate(tuples):
    for v in t:
        values_to_index[v] |= (1 << i)

# 使用整数操作来合并集合
# 不需要弹出元素,构建一个新列表似乎更简单,因为它将比输入短,
# 减少搜索次数
merged_sets = []
for s in values_to_index.values():
    # 为了避免不必要的不相交检查,跟踪合并现有集合的元素并在进行清理时处理
    merges = []
    for i in range(len(merged_sets)):
        if merged_sets[i] & s:
            merged_sets[i] |= s
            merges.append(i)
    if len(merges) == 0:
        merged_sets.append(s)
    elif len(merges) > 1:
        k = merges[0]
        for i in merges[1:]:
            merged_sets[k] |= merged_sets.pop(i)

# 现在我们有一个整数列表,告诉我们每个元组属于哪个集合
output = [set() for _ in range(len(merged_sets))]
for output_index, bits in enumerate(merged_sets):
    while bits:
        b = bits & (~bits + 1)
        output[output_index].update(tuples[b.bit_length() - 1])
        bits ^= b
result = [tuple(s) for s in output]

位迭代器来自https://stackoverflow.com/a/8898977/2988730。

英文:

I wonder if you can get a speedup by taking the idea behind Lover of Structure's answer and moving the O(n^2) portion to operate on integers. Here is an attempt. In this version, I don't need to assume that the input is a list of sets:

from collections import defaultdict

tuples = [(10, 22), (10, 20), (10, 69), (34, 18), (18, 17),
          (89, 990), (86, 80), (174, 175), (543, 542)]
values_to_index = defaultdict(int)

# This part is a single pass over the elements
for i, t in enumerate(tuples):
    for v in t:
        values_to_index[v] |= (1 &lt;&lt; i)

# Merging the sets is done with integer operations
# Instead of popping elements, it seems simpler to construct
# a new list, since that will be shorter than the input,
# reducing the number of searches
merged_sets = []
for s in values_to_index.values():
    # To avoid unnecessary checks for disjointness, keep track of the
    # elements that merge pre-existing sets and clean up as you go
    merges = []
    for i in range(len(merged_sets)):
        if merged_sets[i] &amp; s:
            merged_sets[i] |= s
            merges.append(i)
    if len(merges) == 0:
        merged_sets.append(s)
    elif len(merges) &gt; 1:
        k = merges[0]
        for i in merges[:0:-1]:
            merged_sets[k] |= merged_sets.pop(i)

# Now we have a list of integers telling us which set each tuple goes to
output = [set() for _ in range(len(merged_sets))]
for output_index, bits in enumerate(merged_sets):
    while bits:
        b = bits &amp; (~bits + 1)
        output[output_index].update(tuples[b.bit_length() - 1])
        bits ^= b
result = [tuple(s) for s in output]

The bit iterator is courtesy of https://stackoverflow.com/a/8898977/2988730.

答案3

得分: 1

连接在一起的兼容元组,即那些至少共享一个元素的元组。每个链将是原始数据的一个连接组件。

def cc_from_tuples(pairs:list[tuple]) -> list[tuple]:
    """获取连接组件"""
    ccs = []
    # 数据的新实例
    s_pairs = set(pairs) # 或者 pairs.copy()
    # 对元组进行迭代分类
    while s_pairs:
        # 初始化连接组件
        cc = set(s_pairs.pop())
        # 临时集合
        s_pairs_tmp = set()
        # 将元组分类为连接组件的一部分或下一个候选连接组件
        for t in s_pairs:
            if cc.intersection(t):
                cc.update(t)
            else:
                s_pairs_tmp.add(t)
        # 添加已完成的连接组件
        ccs.append(tuple(cc)) # 转换
        # 迭代步骤
        s_pairs = s_pairs_tmp

    return ccs


ts = [(10,22), (10,20), (10,69), (34,18), (18,17), (89,990), (86,80), (174,175), (543,542)]
cc_from_tuples(ts)
#[(17, 18, 34), (10, 20, 69, 22), (542, 543), (89, 990), (80, 86), (174, 175)]

ts = [(1, 0), (2, 3), (1, 4)]
cc_from_tuples(ts)
#[(0, 1, 4), (2, 3)]

ts = [(1, 0), (2, 3), (1, 2)]
cc_from_tuples(ts)
#[(0, 1, 2, 3)]

如果结果需要按元组长度降序排序,请修改函数的返回语句如下所示:
return sorted(ccs, key=len, reverse=True)

英文:

Chain together compatible tuples, i.e. those which share at least an element. Each chain will be a connected component of the original data.

def cc_from_tuples(pairs:list[tuple]) -&gt; list[tuple]:
    &quot;&quot;&quot;Get Connected Components&quot;&quot;&quot;
    ccs = []
    # new instance of the data
    s_pairs = set(pairs) # or pairs.copy()
    # iterative classification of the pairs
    while s_pairs:
        # initialize connected component
        cc = set(s_pairs.pop())
        # temporary set
        s_pairs_tmp = set()
        # classify tuples as either part of the cc or as next candidate cc
        for t in s_pairs:
            if cc.intersection(t):
                cc.update(t)
            else:
                s_pairs_tmp.add(t)
        # add completed connected component
        ccs.append(tuple(cc)) # casting
        # iteration step
        s_pairs = s_pairs_tmp

    return ccs


ts = [(10,22), (10,20), (10,69), (34,18), (18,17), (89,990), (86,80), (174,175), (543,542)]
cc_from_tuples(ts)
#[(17, 18, 34), (10, 20, 69, 22), (542, 543), (89, 990), (80, 86), (174, 175)]

ts = [(1, 0), (2, 3), (1, 4)]
cc_from_tuples(ts)
#[(0, 1, 4), (2, 3)]

ts = [(1, 0), (2, 3), (1, 2)]
cc_from_tuples(ts)
#[(0, 1, 2, 3)]

If result needs to be sorted by decreasing length of tuples modify the return-statement of the function as follow
return sorted(ccs, key=len, reverse=True)

huangapple
  • 本文由 发表于 2023年3月4日 01:48:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/75630324-2.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定