如何随机选择不重复的样本,每个样本包含一个来自每个组的元素?

huangapple go评论76阅读模式
英文:

How to pick random non-duplicating samples containing one element from each group?

问题

以下是要翻译的部分:

我想从这个列表中挑选两个样本,第一个样本具有C1,第二个样本具有C2。我将在给定的my_list中挑选2个样本,重复80万次。在给定的my_list中,第一项,即以N*开头的项,将始终是唯一的。然而,我要找的最终输出对不应该重复。我总是希望得到一个唯一的对。

输出可以是:

N1 和 N2

N2 和 N1

N6 和 N7

我以前使用过random.sample()与列表,但在这种情况下,我不确定如何应用条件,因为列表元素是用,分隔的。

英文:

Below is the list say my_list. Max items or max length of my_list could be 1 million.

[
	['N1', 'C1'], 
	['N2', 'C2'], 
	['N3', 'C1'], 
	['N4', 'C1'], 
	['N5', 'C1'], 
	['N6', 'C2'], 
	['N7', 'C1']
]

I want to pick two samples from this list where the first one has C1 and the second has C2. I will be picking the 2 samples 80 million times. In the given my_list the first item i.e the ones starting with N* will always be unique. Whereas, the final output pair I'm looking should not have same or duplicate pair repeating. I will always want a unique pair.

The output could be:

N1 and N2

OR

N2 and N1

OR

N6 and N7

I've used random.sample() with list previously, but in this case I'm not sure how to apply condition as the list element are separated by ,.

答案1

得分: 2

以下是代码部分的翻译:

# 如果您需要多次为相同的列表执行此操作,可以将列表分成组,然后从每个组中选择一个:

# 这需要针对my_list执行一次
groups = {}

for n, c in my_list:
    if c not in groups:
        groups[c] = []
    groups[c].append(n)

# 这会从每个组中选择随机元素
result = tuple(random.choice(grp) for grp in groups.values())

例如,运行选择代码10次会得到以下结果:

for _ in range(10):
    # 这会从每个组中选择随机元素
    result = tuple(random.choice(grp) for grp in groups.values())

    print(result)

这段代码可以放入一个类中,以使其更容易使用:

class GroupedSelect:
    def __init__(self, my_list):
        self.groups = {}
        for n, c in my_list:
            if c not in self.groups:
                self.groups[c] = []
            self.groups[c].append(n)

    def select(self):
        return tuple(random.choice(grp) for grp in self.groups.values())


my_selector = GroupedSelect(my_list)
for _ in range(10):
    print(my_selector.select())

要防止重复选择,您需要跟踪已选择的元素。一种方法是找出所有组的元素乘积包含的元素数量,然后以随机顺序返回其中一个。

在创建GroupedSelect对象时,我们进行以下操作:

  • 计算我们将有多少组合。这只是所有组长度的乘积。
  • 创建该长度的索引列表。例如,如果我们在组1中有5个对象,在组2中有2个对象,那么我们有10个组合,所以self._sel_indiceslist(range(10))。然后对此列表进行洗牌。
  • 每次调用select()时,我们从此列表中"弹出"最后一个元素。由于弹出列表的最后一个元素是O(1),这不是一项昂贵的操作。
    • 如果不再存在索引,然后我们可以重新洗牌或引发错误。

弹出的索引将转换为组中的单个坐标。此逻辑与numpy.unravel_index相同。
返回这些坐标处的元素。

import random
from functools import reduce

class GroupedSelect:
    def __init__(self, my_list, raise_error=False):
        self.groups = {}
        self._raise_error = raise_error
        for n, c in my_list:
            if c not in self.groups:
                self.groups[c] = []
            self.groups[c].append(n)

        self._shape = tuple(len(grp) for grp in self.groups.values())
        self._reshuffle()

    def _reshuffle(self):
        n_indices = reduce(lambda x, y: x*y, self._shape)
        
        # 创建一个包含所有可能索引的列表并将其洗牌
        self._sel_indices = list(range(n_indices))
        random.shuffle(self._sel_indices)
        

    def select(self):
        try:
            # 获取要选择的下一个索引
            index = self._sel_indices.pop()
        except IndexError as ex: 
            # 如果我们选择了所有可能的索引,重新洗牌并重新开始,如果需要则引发错误
            if self._raise_error:
                raise RuntimeError("No more combinations to select") from ex
            self._reshuffle()
            index = self._sel_indices.pop()

        # 将索引转换为每个组中的坐标
        coords = self._index_to_coords(index)
        
        # 从每个组中选择正确的值并返回它
        return tuple(grp[coord] for grp, coord in zip(self.groups.values(), coords))
     
    def _index_to_coords(self, index):
        # 这个函数是对numpy.unravel_index的纯Python实现
        coords = []
        for ss in reversed(self._shape):
            index, cc = divmod(index, ss)
            coords.append(cc)
        return coords[::-1]

要使用此类以在耗尽组合后引发错误,请执行以下操作:

my_selector = GroupedSelect(my_list, True)
for _ in range(12):
    print(my_selector.select())

输出如下:

('N3', 'N2')
('N5', 'N6')
('N4', 'N6')
('N1', 'N2')
('N4', 'N2')
('N5', 'N2')
('N7', 'N6')
('N3', 'N6')
('N7', 'N2')
('N1', 'N6')
Traceback (most recent call last):

  File "<file>", line 35, in select
    index = self._sel_indices.pop()

IndexError: pop from empty list

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "<file>", line 70, in <module>
    print(my_selector.select())

  File "<file>", line 40, in select
    raise RuntimeError("No more combinations to select") from ex
RuntimeError: No more combinations to select

您可以捕获RuntimeError以防止此跟踪,或者使用raise_error=False初始化对象,以重新洗牌并继续返回组合,尽管这些组合将是重复的(但不是以相同的顺序):

my_selector = GroupedSelect(my_list, False)
for _ in range(12):
    print(my_selector.select())

输出如下:

('N3', 'N2')
('N5', 'N6')
('N7', 'N2')
('N7', 'N6')
('N5', 'N2')
('N1', 'N2')
('N4', 'N6')
('N3', 'N6')
('N1', 'N6')
('N4', 'N2')
('N7', 'N6')
('N3', 'N6')

希望这些信息对您有所帮助。如果您有任何其他问题,请随时提出。

英文:

If you need to do this multiple times for the same list, you could separate the list into groups and then select one from each group:

# This needs to be done once per my_list
groups = {}

for n, c in my_list:
    if c not in groups:
        groups[c] = []
    groups[c].append(n)

# this selects random elements from each group
result = tuple(random.choice(grp) for grp in groups.values())

For example, running the selection code 10 times gives:

for _ in range(10):
    # this selects random elements from each group
    result = tuple(random.choice(grp) for grp in groups.values())

    print(result)
(&#39;N5&#39;, &#39;N2&#39;)
(&#39;N7&#39;, &#39;N6&#39;)
(&#39;N5&#39;, &#39;N6&#39;)
(&#39;N7&#39;, &#39;N6&#39;)
(&#39;N4&#39;, &#39;N2&#39;)
(&#39;N5&#39;, &#39;N2&#39;)
(&#39;N1&#39;, &#39;N2&#39;)
(&#39;N7&#39;, &#39;N2&#39;)
(&#39;N1&#39;, &#39;N2&#39;)
(&#39;N1&#39;, &#39;N6&#39;)

This code can be put in a class to make it easier to use:

class GroupedSelect:
    def __init__(self, my_list):
        self.groups = {}
        for n, c in my_list:
            if c not in self.groups:
                self.groups[c] = []
            self.groups[c].append(n)

    def select(self):
        return tuple(random.choice(grp) for grp in self.groups.values())


my_selector = GroupedSelect(my_list)
for _ in range(10):
    print(my_selector.select())

To prevent duplicates, you're going to have to keep track of which elements have already been selected. One way to do this would be to figure out how many elements the product of all groups contains, and return one of those in random order.

Upon creating the GroupedSelect object, we

  • Figure out how many combinations we will have. This is simply the product of the lengths of all groups.
  • Create a list of indices of that length. For example, if we have 5 objects in group 1 and 2 in group 2, we have 10 combinations, so self._sel_indices is list(range(10)). Then shuffle this list
  • Every time select() is called, we pop the last element of this list. Since popping the last element of a list is O(1), this is not an expensive operation.
    • If no more indices exist, then we can either reshuffle or raise an error.
  • The popped index is converted to the individual coordinates in the groups. The logic for this is identical to that of numpy.unravel_index.
  • Return the element at these coordinates.
import random
from functools import reduce

class GroupedSelect:
    def __init__(self, my_list, raise_error=False):
        self.groups = {}
        self._raise_error = raise_error
        for n, c in my_list:
            if c not in self.groups:
                self.groups[c] = []
            self.groups[c].append(n)

        self._shape = tuple(len(grp) for grp in self.groups.values())
        self._reshuffle()

    def _reshuffle(self):
        n_indices = reduce(lambda x, y: x*y, self._shape)
        
        # Create a list of all possible indices and shuffle it
        self._sel_indices = list(range(n_indices))
        random.shuffle(self._sel_indices)
        

    def select(self):
        try:
            # Get the next index to select
            index = self._sel_indices.pop()
        except IndexError as ex: 
            # If we&#39;ve selected all possible indices, reshuffle everything and start over
            # or raise an error if needed
            if self._raise_error:
                raise RuntimeError(&quot;No more combinations to select&quot;) from ex
            self._reshuffle()
            index = self._sel_indices.pop()

        # Convert index to coordinates in each group
        coords = self._index_to_coords(index)
        
        # Select the correct value from each group and return it
        return tuple(grp[coord] for grp, coord in zip(self.groups.values(), coords))
     
    def _index_to_coords(self, index):
        # This function is a pure-python implementation of numpy.unravel_index
        coords = []
        for ss in reversed(self._shape):
            index, cc = divmod(index, ss)
            coords.append(cc)
        return coords[::-1]

To use this class such that it will throw an error after it runs out of combinations, do:

my_selector = GroupedSelect(my_list, True)
for _ in range(12):
    print(my_selector.select())

which gives:

(&#39;N3&#39;, &#39;N2&#39;)
(&#39;N5&#39;, &#39;N6&#39;)
(&#39;N4&#39;, &#39;N6&#39;)
(&#39;N1&#39;, &#39;N2&#39;)
(&#39;N4&#39;, &#39;N2&#39;)
(&#39;N5&#39;, &#39;N2&#39;)
(&#39;N7&#39;, &#39;N6&#39;)
(&#39;N3&#39;, &#39;N6&#39;)
(&#39;N7&#39;, &#39;N2&#39;)
(&#39;N1&#39;, &#39;N6&#39;)
Traceback (most recent call last):

  File &quot;&lt;file&gt;&quot;, line 35, in select
    index = self._sel_indices.pop()

IndexError: pop from empty list

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File &quot;&lt;file&gt;&quot;, line 70, in &lt;module&gt;
    print(my_selector.select())

  File &quot;&lt;file&gt;&quot;, line 40, in select
    raise RuntimeError(&quot;No more combinations to select&quot;) from ex
RuntimeError: No more combinations to select

You can catch a RuntimeError to prevent this traceback, or initialize the object with raise_error=False to re-shuffle and keep returning combinations, although these will be duplicates (but not in the same order as before):

my_selector = GroupedSelect(my_list, False)
for _ in range(12):
    print(my_selector.select())

outputs:

(&#39;N3&#39;, &#39;N2&#39;)
(&#39;N5&#39;, &#39;N6&#39;)
(&#39;N7&#39;, &#39;N2&#39;)
(&#39;N7&#39;, &#39;N6&#39;)
(&#39;N5&#39;, &#39;N2&#39;)
(&#39;N1&#39;, &#39;N2&#39;)
(&#39;N4&#39;, &#39;N6&#39;)
(&#39;N3&#39;, &#39;N6&#39;)
(&#39;N1&#39;, &#39;N6&#39;)
(&#39;N4&#39;, &#39;N2&#39;)
(&#39;N7&#39;, &#39;N6&#39;)
(&#39;N3&#39;, &#39;N6&#39;)

答案2

得分: 1

使用 random.choice:

def pick_samples(all_choices):
    a = random.choice(all_choices)
    while True:
        b = random.choice(all_choices)
        if b[1] != a[1]:
            break
    return a[0], b[0]

如果可以的话,可以通过将选择拆分为两组并从每组中选择一个来简化:

c1_list = [c[0] for c in all_choices if c[1] == 'C1']
c2_list = [c[0] for c in all_choices if c[1] == 'C2']

def pick_samples(c1_list, c2_list):
    return random.choice(c1_list), random.choice(c2_list)
英文:

Use random.choice:

def pick_samples(all_choices):
    a = random.choice(all_choices)
    while True:
        b = random.choice(all_choices)
        if b[1] != a[1]:
            break
    return a[0], b[0]

If you're able, you can simplify by breaking the choices into two groups and choosing one from each:

c1_list = [c[0] for c in all_choices if c[1] == &#39;C1&#39;]
c2_list = [c[0] for c in all_choices if c[1] == &#39;C2&#39;]

def pick_samples(c1_list, c2_list):
    return random.choice(c1_list), random.choice(c2_list)

huangapple
  • 本文由 发表于 2023年2月24日 01:25:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75548269.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定