在设定的时间内随机生成两个列表中所有唯一的两两组合元素。

huangapple go评论65阅读模式
英文:

Randomly generate all unique pair-wise combination of elements between two list in set time

问题

def _unique_combinations(a: List, b: List, max_iterations: int):
    """
    Creates a generator that yields unique combinations of elements from a and b
    in the form of (a_element, b_element) tuples in a random order, up to a maximum
    number of iterations specified by max_iterations.
    """
    len_a, len_b = len(a), len(b)
    generated = set()
    iterations = 0

    while iterations < max_iterations:
        # Choose random elements from a and b
        element_a = random.choice(a)
        element_b = random.choice(b)
        if (element_a, element_b) not in generated:
            generated.add((element_a, element_b))
            yield (element_a, element_b)
            iterations += 1
英文:

I have two lists:

a = [1, 2, 3, 5]
b = [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;]

And would like to generate all possible combinations with a python generator. I know I could be doing:

combinations = list(itertools.product(a,b))
random.shuffle(combinations)

But that one has an extreme memory cost as i would have to hold in memory all possible combinations, even if only wanted two random unique combinations.

My target is to get a python generator that has its memory cost increase with the more iterations are requested from it, getting to the same O memory cost as itertools at max iterations.

I had this for now:

def _unique_combinations(a: List, b: List):
    &quot;&quot;&quot;
    Creates a generator that yields unique combinations of elements from a and b
    in the form of (a_element, b_element) tuples in a random order.
    &quot;&quot;&quot;
    len_a, len_b = len(a), len(b)
    generated = set()
    for i in range(len_a):
        for j in range(len_b):
            while True:
                # choose random elements from a and b
                element_a = random.choice(a)
                element_b = random.choice(b)
                if (element_a, element_b) not in generated:
                    generated.add((element_a, element_b))
                    yield (element_a, element_b)
                    break

But its flawed as it can theoretically run forever if the random.choice lines are unlucky.

I'm looking to modify that existing generator so it generates the indexes randomly within a fix set of time, it will be okay to keep them track of as this will be linear increase in memory cost and not exponential.

How could i modify that random index generator to be bound in time?

答案1

得分: 1

随机选择在开始时便宜,到最后变得昂贵,而穷举列表在开始时昂贵,到最后变得便宜。这里有一种“两者兼顾”的方法,我们在中途切换策略:

import itertools
import random
from typing import Iterator, TypeVar

_A = TypeVar("_A")
_B = TypeVar("_B")

def unique_product(a: list[_A], b: list[_B]) -> Iterator[tuple[_A, _B]]:
    total = len(a) * len(b)
    results: set[tuple[_A, _B]] = set()

    # 对于结果的前半部分,使用随机猜测。
    # 在我们用尽一半的可能性之前,我们应该平均每个元素 &lt; 2 次猜测,
    # 因此,我们可以将其视为每个元素的摊销 O(1)。
    result = random.choice(a), random.choice(b)
    while len(results) < total // 2:
        while result in results:
            result = random.choice(a), random.choice(b)
        results.add(result)
        yield result

    # 对于后半部分,构建剩余结果的详尽集合。
    # 我们需要付出 O(n) 的代价来执行此操作,
    # 但在整个生成器生命周期内,它的摊销代价是 O(1) 每个元素。
    # 这之后,我们的空间使用量会减少,而不是增加。
    remaining: list[tuple[_A, _B]] = []
    for result in itertools.product(a, b):
        if result in results:
            results.remove(result)
        else:
            remaining.append(result)
    random.shuffle(remaining)
    while remaining:
        yield remaining.pop()

这种方法的主要潜在问题是,您在中间部分付出了 O(n) 的代价,因此,即使在查看整个运行时会摊销,对于某些用例来说,让中间的单个调用者一次性支付整个代价可能是不可取的,而不是在一开始支付它,或者在所有调用者之间均匀分摊它。(我可以想到一些方法可以尝试通过在另一个线程中执行交换来避免这个问题,但这会增加很多复杂性。也许有更好的方法?)

请注意,就空间而言,这是相当优化的,因为您在一半的位置达到了空间极限(将一半的元素保留在内存中),然后空间使用量会减少回到零,因为现在跟踪您尚未分配的元素比跟踪已分配的元素更便宜。

英文:

Random picking is cheap at the beginning and expensive at the end, and exhaustive listing is expensive at the beginning and cheap at the end. Here's a "best of both worlds" approach where we switch strategies halfway through:

import itertools
import random
from typing import Iterator, TypeVar

_A = TypeVar(&quot;_A&quot;)
_B = TypeVar(&quot;_B&quot;)

def unique_product(a: list[_A], b: list[_B]) -&gt; Iterator[tuple[_A, _B]]:
    total = len(a) * len(b)
    results: set[tuple[_A, _B]] = set()

    # For the first half of the results, use random guessing.
    # Until we&#39;ve exhausted half of the possibility
    # space, we should average &lt; 2 guesses per element, so
    # we can consider this to be amortized O(1) per element.
    result = random.choice(a), random.choice(b)
    while len(results) &lt; total // 2:
        while result in results:
            result = random.choice(a), random.choice(b)
        results.add(result)
        yield result

    # For the second half, build the exhaustive set of
    # remaining results.  We pay an O(n) cost to do this but
    # amortized over the entire generator life it&#39;s O(1) per
    # element.  Our space usage goes down after this, not up.
    remaining: list[tuple[_A, _B]] = []
    for result in itertools.product(a, b):
        if result in results:
            results.remove(result)
        else:
            remaining.append(result)
    random.shuffle(remaining)
    while remaining:
        yield remaining.pop()

The main potential problem with this approach is that you're paying a big O(n) cost right in the middle, so even though it washes out when you're looking at the entire run, for some use cases it might be undesirable to have a single arbitrary caller in the middle paying that entire cost all at once, as opposed to paying it up front, or evenly spreading it out among all callers. (I can think of ways you might try to avoid this by doing the swap in another thread, but that adds a lot of complexity. Maybe there's a better way?)

Note that as far as space goes this is pretty optimal, since you max out the space halfway through (with half of the elements in memory) and then the space usage decreases back toward zero since it's now cheaper to keep track of the elements you haven't allocated than the ones you have.

答案2

得分: 1

我们使用一个质数和它的一个模n的原根来创建一个序列,该序列确保在一个区间中每个数字都恰好出现一次。
更具体地,我们要寻找一个整数模n的乘法群的生成元。

我们必须选择一个比len(a)*len(b)的乘积稍大的质数,以便考虑到可能出现索引错误的情况。

import random
from math import gcd
import math

def next_prime(number):
    if number < 0:
        raise ValueError('负数不能是质数')
    if number <= 1:
        return 2
    if number % 2 == 0:
        number -= 1
    while True:
        number += 2
        max_check = int(math.sqrt(number)) + 2
        for divider in range(3, max_check, 2):
            if number % divider == 0:
                break
        else:
            return number


def is_primitive_root(a, n):
    phi = n - 1
    factors = set()
    for i in range(2, int(phi ** 0.5) + 1):
        if phi % i == 0:
            factors.add(i)
            factors.add(phi // i)
    for factor in factors:
        if pow(a, factor, n) == 1:
            return False
    return True


def find_random_primitive_root(n):
    while True:
        a = random.randint(2, n - 1)
        if gcd(a, n) == 1 and is_primitive_root(a, n):
            return a


def sampler(l):
    close_prime = next_prime(l)
    state = root = find_random_primitive_root(close_prime)
    while state > l:
        state = (state * root) % close_prime  # 内联计算可以提高20%的速度

    yield state - 1
    for i in range(l - 1):
        state = (state * root) % close_prime
        while state > l:
            state = (state * root) % close_prime
        yield state - 1

然后,我们使用从1D到2D的映射来将我们的序列号转换为元组并产生结果。

def _unique_combinations(a, b):
    cartesian_product_cardinality = len(a) * len(b)
    sequence = sampler(cartesian_product_cardinality)
    len_b = len(b) # Python中的函数调用是昂贵的,这一行可以提高总速度10%
    for state in sequence:
        yield a[state // len_b], b[state % len_b)]
    

from itertools import product

a = [1, 2, 3, 5]
b = ["a", "b", "c", "d"]
u = _unique_combinations(a, b)

assert sorted(u) == sorted(product(a, b))

我开始对各种方法进行基准测试。对于合并长度为1000的两个列表,@gog的divmod解决方案性能非常差,因此我将其从进一步测试中排除:

kelly took 0.9156949520111084 seconds
divmod took 41.20149779319763 seconds
prime_roots took 0.5146901607513428 seconds
samwise took 0.698538064956665 seconds
fisher_yates took 0.902874231338501 seconds

对于剩余的算法,我进行了以下基准测试:

import pandas as pd
import timeit
import random
from itertools import combinations
from math import gcd

# 定义要进行基准测试的列表长度
list_lengths = [10,20,30,100,300,500,1000,1500,2000,3000,5000]

num_repetitions = 2

results_df = pd.DataFrame(columns=['方法', '列表长度', '执行时间'])

for approach, function in approaches.items():
    for length in list_lengths:
        a = list(range(length))
        b = list(range(length))

        execution_time = timeit.timeit(lambda: list(function(a, b)), number=num_repetitions)

        results_df = results_df.append({
            '方法': approach,
            '列表长度': length,
            '执行时间': execution_time 
        }, ignore_index=True)

总的来说,我认为所有方法都有相似之处。从时间复杂度的角度来看,所有测试过的方法在时间复杂度方面都是O(|a|*|b|)

从内存方面来看,原根方法因为所有其他方法都需要跟踪O(|a|*|b|)个元素而获胜,而原根方法不需要。

从分布方面来看,原根方法绝对是最差的,因为它实际上不是随机的,而是一个难以预测的确定性序列。在实践中,这些序列应该足够“随机”。

感谢这个stackoverflow答案,它启发了这个解决方案。

英文:

We create a sequence using a prime number and one of its primitive roots modulo n that visits each number in an interval exactly once.
More specifically we are looking for a generator of the multiplicative group of integers modulo n.

We have to pick our prime number a little larger than the product len(a)*len(b), so we have to account for the cases in which we'd get index errors.

import random
from math import gcd
import math

def next_prime(number):
    if number &lt; 0:
        raise ValueError(&#39;Negative numbers can not be primes&#39;)
    if number &lt;= 1:
        return 2
    if number % 2 == 0:
        number -= 1
    while True:
        number += 2
        max_check = int(math.sqrt(number)) + 2
        for divider in range(3, max_check, 2):
            if number % divider == 0:
                break
        else:
            return number


def is_primitive_root(a, n):
    phi = n - 1
    factors = set()
    for i in range(2, int(phi ** 0.5) + 1):
        if phi % i == 0:
            factors.add(i)
            factors.add(phi // i)
    for factor in factors:
        if pow(a, factor, n) == 1:
            return False
    return True


def find_random_primitive_root(n):
    while True:
        a = random.randint(2, n - 1)
        if gcd(a, n) == 1 and is_primitive_root(a, n):
            return a


def sampler(l):
    close_prime = next_prime(l)
    state = root = find_random_primitive_root(close_prime)
    while state &gt; l:
        state = (state * root) % close_prime  # Inlining the computation leads to a 20% speed up

    yield state - 1
    for i in range(l - 1):
        state = (state * root) % close_prime
        while state &gt; l:
            state = (state * root) % close_prime
        yield state - 1

Then we use a mapping from 1D -> 2D to "translate" our sequence number into a tuple and yield the result.

def _unique_combinations(a, b):
    cartesian_product_cardinality = len(a) * len(b)
    sequence = sampler(cartesian_product_cardinality)
    len_b = len(b) # Function calls are expensive in python and this line yields a 10% total speed up
    for state in sequence:
        yield a[state // len_b], b[state % len_b)]


from itertools import product

a = [1, 2, 3, 5]
b = [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;]
u = _unique_combinations(a, b)

assert sorted(u) == sorted(product(a, b))

I started benchmarking the various approaches. For merging two lists of length 1000, the divmod solution by @gog already underperforms terrible so i'm going to exclude it from further testing:

kelly took 0.9156949520111084 seconds
divmod took 41.20149779319763 seconds
prime_roots took 0.5146901607513428 seconds
samwise took 0.698538064956665 seconds
fisher_yates took 0.902874231338501 seconds

For the remaining algorithms I benchmarked the following

import pandas as pd
import timeit
import random
from itertools import combinations
from math import gcd
# Define the list lengths to benchmark
list_lengths = [10,20,30,100,300,500,1000,1500,2000,3000,5000]

num_repetitions = 2

results_df = pd.DataFrame(columns=[&#39;Approach&#39;, &#39;List Length&#39;, &#39;Execution Time&#39;])

for approach, function in approaches.items():
    for length in list_lengths:
        a = list(range(length))
        b = list(range(length))

        execution_time = timeit.timeit(lambda: list(function(a, b)), number=num_repetitions)

        results_df = results_df.append({
            &#39;Approach&#39;: approach,
            &#39;List Length&#39;: length,
            &#39;Execution Time&#39;: execution_time 
        }, ignore_index=True)

在设定的时间内随机生成两个列表中所有唯一的两两组合元素。
All in all, I think all of the approaches are somewhat similar. All tested approaches fall in O(|a|*|b|) time-complexity wise.

Memory-wise the prime roots approach wins just because all other approaches need to keep track of O(|a|*|b|) elements, whereas the prime roots doesn't require that.

Distribution wise the prime roots approach is absolutely the worst because it's not actually random but rather a difficult-to-predict-deterministic sequence. In practice the sequences should be "random" enough.

Credit to this stack overflow answer which inspired the solution.

答案3

得分: 0

import random


def _unique_combinations(a, b):
    sa = len(a)
    sb = len(b)
    sx = sa * sb
    seen = set()
    while len(seen) < sx:
        n = random.randint(0, sx - 1)
        while n in seen:
            n = (n + 1) % sx
        seen.add(n)
        p, q = divmod(n, sa)
        yield a[q], b

b = ["a", "b", "c", "d"] u = list(_unique_combinations(a, b)) print(u) # confirm everything has been generated from itertools import product assert sorted(u) == sorted(product(a, b))

英文:

Represent each combination by a number (position_in_a * len_a) + position_in_b. Keep generating these numbers randomly, once a number has been hit, just increment it mod len_a * len_b:

import random


def _unique_combinations(a, b):
    sa = len(a)
    sb = len(b)
    sx = sa * sb
    seen = set()
    while len(seen) &lt; sx:
        n = random.randint(0, sx - 1)
        while n in seen:
            n = (n + 1) % sx
        seen.add(n)
        p, q = divmod(n, sa)
        yield a[q], b

## a = [1, 2, 3, 5] b = [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;] u = list(_unique_combinations(a, b)) print(u) # confirm everything has been generated from itertools import product assert sorted(u) == sorted(product(a, b)) # [(3, &#39;c&#39;), (1, &#39;c&#39;), (5, &#39;c&#39;), (2, &#39;b&#39;), (5, &#39;b&#39;), (1, &#39;b&#39;), (2, &#39;c&#39;), (3, &#39;d&#39;), (2, &#39;a&#39;), (3, &#39;b&#39;), (1, &#39;d&#39;), (2, &#39;d&#39;), (1, &#39;a&#39;), (5, &#39;d&#39;), (3, &#39;a&#39;), (5, &#39;a&#39;)]

答案4

得分: 0

这是否符合你的要求?

您可以使用整数索引从所有可能的组合列表中生成任意组合:

def get_combination(x, values):

    digits = []
    for items in reversed(values):
        base = len(items)
        i = int(x % base)
        digits.append(items[i])
        x = x // base

    digits.reverse()

    return digits

values = [[1, 2, 3, 5], ["a", "b", "c", "d"]]
assert(get_combination(0, values) == [1, 'a'])
assert(get_combination(1, values) == [1, 'b'])
assert(get_combination(15, values) == [5, 'd'])

因此,您不需要生成组合列表的洗牌列表。我认为在不生成列表的情况下迭代采样范围内的整数而不重复没有任何方法(正如在此问题的答案中解释的那样),但至少现在您只需要生成一个整数数组,这需要更少的内存:

import numpy as np

rng = np.random.default_rng()

def shuffled_combinations(values):
    counts = [len(x) for x in values]
    n = np.array(counts).prod()
    index = np.arange(n, dtype='uint')
    rng.shuffle(index)
    for i in index:
        yield get_combination(i, values)

for c in shuffled_combinations(values):
    print(c)

输出:

[1, 'd']
[2, 'c']
[3, 'a']
[5, 'a']
[5, 'd']
[5, 'b']
[3, 'c']
[2, 'b']
[1, 'a']
[1, 'b']
[5, 'c']
[1, 'c']
[3, 'b']
[2, 'a']
[2, 'd']
[3, 'd']
英文:

Does this do what you wanted?

You can generate any combination from the list of all possible combinations using an integer index:

def get_combination(x, values):

    digits = []
    for items in reversed(values):
        base = len(items)
        i = int(x % base)
        digits.append(items[i])
        x = x // base

    digits.reverse()

    return digits

values = [[1, 2, 3, 5], [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;]]
assert(get_combination(0, values) == [1, &#39;a&#39;])
assert(get_combination(1, values) == [1, &#39;b&#39;])
assert(get_combination(15, values) == [5, &#39;d&#39;])

Therefore you don't need to generate the shuffled list of combinations. I don't think there is any way to iteratively sample the integers in a range without repetition without generating the list (as explained in the answers to this question) but at least now you only need to generate an array of integers, which takes less memory:

import numpy as np

rng = np.random.default_rng()

def shuffled_combinations(values):
    counts = [len(x) for x in values]
    n = np.array(counts).prod()
    index = np.arange(n, dtype=&#39;uint&#39;)
    rng.shuffle(index)
    for i in index:
        yield get_combination(i, values)

for c in shuffled_combinations(values):
    print(c)

Output:

[1, &#39;d&#39;]
[2, &#39;c&#39;]
[3, &#39;a&#39;]
[5, &#39;a&#39;]
[5, &#39;d&#39;]
[5, &#39;b&#39;]
[3, &#39;c&#39;]
[2, &#39;b&#39;]
[1, &#39;a&#39;]
[1, &#39;b&#39;]
[5, &#39;c&#39;]
[1, &#39;c&#39;]
[3, &#39;b&#39;]
[2, &#39;a&#39;]
[2, &#39;d&#39;]
[3, &#39;d&#39;]

答案5

得分: 0

A variation of Samwise's, but avoiding the large middle cost to create remaining by spreading its creation over the first half of the process and the cost to randomize it over the second half of the process. The conversion from set to list is relatively fast.

我对 Samwise 的 变种,但是避免了在创建 remaining 时产生大量中间成本,通过将其创建分散在过程的前半部分并将随机化成本分布在过程的后半部分来实现。从集合转换为列表相对较快。

I suspect it's slower than Samwise's overall (and it does use more memory). It just might be better if the delay in the middle is unacceptable.

我怀疑它的整体速度比 Samwise 的要慢(而且它确实使用更多内存)。如果中间的延迟不可接受,它可能会更好。

import random
import itertools

def random_order_product(a, b):
    yielded = set()
    remaining = set()

    for i, pair in enumerate(itertools.product(a, b)):
        if pair not in yielded:
            remaining.add(pair)
        if i % 2:
             pair = random.choice(a), random.choice(b)
             while pair in yielded:
                 pair = random.choice(a), random.choice(b)
             yield pair
             yielded.add(pair)
             remaining.discard(pair)

    remaining = list(remaining)
    while remaining:
        i = random.randrange(len(remaining))
        yield remaining[i]
        remaining[i] = remaining[-1]
        remaining.pop()


# Demo showing the frequencies of the 4!=24 possible orders
from collections import Counter
print(sorted(Counter(
    tuple(random_order_product('ab', 'cd'))
    for _ in range(100000)
).values()))

演示显示了 4! = 24 种可能的顺序的频率。

Sample output of the order frequencies (Attempt This Online!):

以下是顺序频率的示例输出:

[4045, 4078, 4107, 4112, 4113, 4113,
 4127, 4131, 4131, 4135, 4136, 4142,
 4149, 4164, 4172, 4186, 4188, 4196,
 4212, 4235, 4245, 4260, 4279, 4344]
英文:

A variation of Samwise's, but avoiding the large middle cost to create remaining by spreading its creation over the first half of the process and the cost to randomize it over the second half of the process. The conversion from set to list is relatively fast.

I suspect it's slower than Samwise's overall (and it does use more memory). It just might be better if the delay in the middle is unacceptable.

import random
import itertools

def random_order_product(a, b):
    yielded = set()
    remaining = set()

    for i, pair in enumerate(itertools.product(a, b)):
        if pair not in yielded:
            remaining.add(pair)
        if i % 2:
             pair = random.choice(a), random.choice(b)
             while pair in yielded:
                 pair = random.choice(a), random.choice(b)
             yield pair
             yielded.add(pair)
             remaining.discard(pair)

    remaining = list(remaining)
    while remaining:
        i = random.randrange(len(remaining))
        yield remaining[i]
        remaining[i] = remaining[-1]
        remaining.pop()


# Demo showing the frequencies of the 4!=24 possible orders
from collections import Counter
print(sorted(Counter(
    tuple(random_order_product(&#39;ab&#39;, &#39;cd&#39;))
    for _ in range(100000)
).values()))

Sample output of the order frequencies (Attempt This Online!):

[4045, 4078, 4107, 4112, 4113, 4113,
 4127, 4131, 4131, 4135, 4136, 4142,
 4149, 4164, 4172, 4186, 4188, 4196,
 4212, 4235, 4245, 4260, 4279, 4344]

答案6

得分: -3

@gog:
这段代码片段在可扩展性方面存在限制。它使用集合来跟踪生成的组合,随着可能组合的总数增加,内存使用和性能会成为问题。

英文:

@gog:
The code snippet has a limitation in terms of scalability. It is utilizing a set to track generated combinations, the memory usage and performance become problematic as the total number of possible combinations increases.

答案7

得分: -3

在你着手编写这样的程序之前,请让我介绍一下'排列和组合'。假设你有一个名为fruits的列表(fruits = ['apples', 'mangoes', 'grapes'])。你可以排列这个列表的次数称为排列。这在数学上表示为(!)。现在,我们的列表包含三个项目。我们可以通过(3!)来找到可能的排列数量,这等于6。现在,你只有六个移动或可能的洗牌方式。另一方面,组合基本上是从列表中选择特定数量的排列,例如,在我们的列表中,你想找出只有两个项目的组合数量。这可以数学表示为(2C3),其中2是项目数量,3是总项目数量。这将给你3。然而,在Python中,我建议你使用itertools。这是一个很棒的模块,可以让你的工作更容易。不过,我建议你访问以下链接以获取更多见解。https://www.digitalocean.com/community/tutorials/permutation-and-combinatios-in-python

英文:

Kindly, before you engage in writing such a program, let me introduce you to 'Permutations and Combinations'. Let's say, you have a list names fruits (fruits = ['apples','mangoes','grapes']). The number of times you can arrange the list is called permutation. This is expressed mathematically as (!). Now, our list contains three items. We can find the number of possible arrangements by (3!) which is equal to 6. Now, you have only six moves or possible shuffles. Combinations on the other hand is basically choosing from a list a certain number of permutations from specific items for example, let's say in our list, you want to find out the number of combinations of just two items. This can be expressed mathematically as (2C3) where 2 is the number of items and 3 is the total number of items. This will give you 3. However, in python, I would advise you to use itertools. This is an amazing module which will make your work easier. However, I would like you to visit the following link for more insight. https://www.digitalocean.com/community/tutorials/permutation-and-combinatios-in-python

huangapple
  • 本文由 发表于 2023年6月18日 23:29:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76501267.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定