如何使用Python高效生成字符串’AABBBCCCCCDDDDDEEEEE’的所有不重复排列?

huangapple go评论115阅读模式
英文:

How can I efficiently generate all permutations of 'AABBBCCCCCDDDDDEEEEE' without repetitions using Python?

问题

我正在尝试生成序列 'AABBBCCCCCDDDDDEEEEE' 的所有不重复排列。
在Python中有没有高效的方法可以做到这一点?

我使用的代码:

for s in permutations('AABBBCCCCCDDDDDEEEEE'):
df.loc[len(df)] = s
if len(df) % 1000 == 0:
df = df.drop_duplicates(ignore_index=True)


但遗憾的是,这非常慢。
英文:

I am trying to generate all permutations of the sequence 'AABBBCCCCCDDDDDEEEEE' without repetitions.
Is there an efficient way to do this in python?

Code I use:


for s in permutations('AABBBCCCCCDDDDDEEEEE'):
    df.loc[len(df)] = s
    if len(df) % 1000 == 0:
        df = df.drop_duplicates(ignore_index=True)

But this is very slow unfortunately.

答案1

得分: 1

from more_itertools import distinct_permutations
sequence = 'AABBCCDDEE'
def updated_code():
unique_permutations = list(distinct_permutations(sequence))
return unique_permutations

you can use the distinct_permutations method from more_itertools

英文:
from more_itertools import distinct_permutations
sequence = 'AABBCCDDEE'
def updated_code():
     unique_permutations = list(distinct_permutations(sequence))
     return unique_permutations

you can use the distinct_permutations method from more_itertools

答案2

得分: 0

为了去重排列,您可以使用一个 set

for s in set(permutations('AABBBCCCCCDDDDDEEEEE')):
    df.loc[len(df)] = s
    if len(df) % 1000 == 0:
        df = df.drop_duplicates(ignore_index=True)

遍历排列非常昂贵,因为涉及到组合学的计算,我不认为您可以加速这一步骤。在您的循环内部要做什么?可能有一种优化的方法。

英文:

To de-duplicate the permutations you could use a set:

for s in set(permutations('AABBBCCCCCDDDDDEEEEE')):
    df.loc[len(df)] = s
    if len(df) % 1000 == 0:
        df = df.drop_duplicates(ignore_index=True)

Iterating over permutations is very costly because of the combinatorics at play, I don't think you can speed that step up. What are you trying to do inside your loop? There may be a way of optimizing there.

答案3

得分: 0

我的方法不使用特定语言库,也不需要潜在的大内存(例如其他解决方案中的集合或哈希映射):

  1. 用另一个独特的符号替换每个位置。让它们为每个原始字母引入一种顺序(从而引入了严格的偏序)。基本上是 AABB -> A_1 A_2 B_1 B_2
  2. 使用任何算法创建排列。
  3. 对于每个创建的排列,检查是否破坏了偏序。例如,A_1 B_1 A_2 B_2 是可以的,但 A_2 B_1 A_1 B_2 不可以,因为 A_2 不应该在 A_1 之前出现。如果顺序被破坏,丢弃该项目。
  4. 将符号重新转换为原始符号。

[显然,你不会有一个符号 "A_1",所以可以选择任何Unicode符号,并为每个原始字母创建一个映射数组,同时引入顺序。]

现在,如果最终有两个等效的排列,鉴于底层唯一符号,它们必须在其中一个位置上有两个相同的字母,这两个字母来自不同顺序的底层唯一符号,这与顺序没有被破坏的原则相矛盾,因此不可能发生。

再次强调,此解决方案是为了解决集合解决方案可能会对内存产生问题的情况而提出的,您可以直接对排列进行操作,然后在使用下一个排列之前将其丢弃。如果这对您来说不是问题,那么请使用集合解决方案。

英文:

My approach, which does not use language specific libraries and does not require a potentially huge additional memory (like the set or hash map from the other solutions):

  1. Replace every position with another symbol that is unique. Have them induce an order for each original letter (which in turn induces a strict partial order). Basically AABB -> A_1 A_2 B_1 B_2
  2. Use any algorithm to create permutations.
  3. For each created permutation, check whether the partial order is broken. For instance A_1 B_1 A_2 B_2 is fine, but A_2 B_1 A_1 B_2 is not, because A_2 must not come before A_1. If the order is broken, throw the item away.
  4. Retransform the symbols into the original ones.

[Obviously you wouldn't have a symbol "A_1", so instead go for any unicode symbol and have something like an array for each original letter that does the mapping and also induces an order.]

Now if you'd have two equivalent permutations in the end, given the underlying unique symbols, they must have one position in which two same letters were created from a different order of underlying unique symbols, which is a contradiction to the order not being broken -> can't happen.

Again, this solution is meant in case that the set solution is problematic for your memory, and you being able to directly do something with the permutation and then forget it before the next one is used.
If this is not an issue for you, use the set solution.

答案4

得分: 0

以下是您要翻译的部分:

首先,关于这行代码:

df = df.drop_duplicates(ignore_index=True)

这行代码实际上没有什么作用,运行下面的代码将证明这一点:

from itertools import permutations
import pandas as pd

df = pd.DataFrame(columns=['1', '2', '3'])
for s in permutations('AAA'):
df.loc[len(df.index)] = s
if len(df) % 1000 == 0:
df = df.drop_duplicates(ignore_index=True)

print(df)

现在描述您的情境,您正在尝试计算20个字符的排列,这将需要20 * 20!字节的数据,这相当于44,254,229 TB的数据,是的,这将需要很长时间,也许需要几个世纪,而且不会适合您的RAM/硬盘。

这是您可以获得所需答案的方法:

from more_itertools import distinct_permutations
import pandas as pd

df = pd.DataFrame(columns=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20'])
for s in distinct_permutations('AABBBCCCCCDDDDDEEEEE'):
print(s)
df.loc[len(df.index)] = s

print(df)

这将需要很长时间。

英文:

well first off all the line:

df = df.drop_duplicates(ignore_index=True)

does nothing and running the code below will prove that:

from itertools import permutations
import pandas as pd

df = pd.DataFrame(columns=['1', '2', '3'])
for s in permutations('AAA'):
    df.loc[len(df.index)] = s
    if len(df) % 1000 == 0:
        df = df.drop_duplicates(ignore_index=True)

print(df)

now describing your scenario you are trying to calculate permutations of 20 characters which is 20 * 20! bytes of data
that is 44,254,229 TBs of data yes its gonna take a while centuries maybe and nor will fit on your ram/hard dist

this is the way you could get the answer you are looking for:

from more_itertools import distinct_permutations
import pandas as pd

df = pd.DataFrame(columns=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20'])
for s in distinct_permutations('AABBBCCCCCDDDDDEEEEE'):
    print(s)
    df.loc[len(df.index)] = s

print(df)

which will take so much time

答案5

得分: -2

生成不重复序列的所有排列可能是一个计算成本高昂的任务,特别是当序列很长时。在您当前的代码中,您正在使用itertools模块的permutations函数,该函数生成所有可能的排列。然而,这可能会导致大量的排列,从而导致性能较慢。

改进效率的一种可能方法(也是我建议给您的方法)是使用回溯算法来生成无重复排列。下面是一个Python中的示例实现:

def generate_permutations(sequence):
    used = set()

    def backtrack(curr_perm, remaining):
        if not remaining:
            yield curr_perm
        else:
            for i, char in enumerate(remaining):
                if char not in used:
                    used.add(char)
                    yield from backtrack(curr_perm + char, remaining[:i] + remaining[i+1:])
                    used.remove(char)

    yield from backtrack('', sequence)

sequence = 'AABBBCCCCCDDDDDEEEEE'
df = pd.DataFrame(columns=['Permutation'])
for perm in generate_permutations(sequence):
    df.loc[len(df)] = perm
    if len(df) % 1000 == 0:
        df = df.drop_duplicates(ignore_index=True)

在这段代码中,generate_permutations 函数使用递归回溯方法生成所有排列。该函数维护一个已使用字符的集合,以避免重复。它产生每个有效的排列,就像您之前所做的那样,可以在DataFrame df 中收集它们。

通过使用yieldyield from,我们可以动态生成排列,并高效地迭代它们,而无需同时在内存中存储所有排列。这在处理大序列或内存使用是一个考虑因素时特别有用。

通过使用回溯并避免不必要的排列,这个实现应该比预先生成所有排列更高效。但是,请注意,对于长度为n的序列,生成所有不重复排列仍然具有**O(n!)**的时间复杂度,对于较大的n值,这可能变得不切实际。

回溯方法通过使用集合来跟踪已使用的字符并在递归过程中跳过已使用的字符,有助于避免生成重复排列。然而,它并不减少生成所有无重复排列的总体复杂度。

如果您处理的序列长度很大,由于问题的指数级复杂度,可能无法找到所有不重复排列。在这种情况下,您可能需要考虑替代方法,或重新评估问题以找到更高效的解决方案,或以某种方式近似结果。

英文:

Generating all permutations of a sequence without repetitions can be a computationally expensive task, especially when the sequence is long. In your current code, you are using the permutations function from the itertools module, which generates all possible permutations. However, this can result in a large number of permutations, leading to slow performance.

One possible approach (so my approach that I suggest to you) to improve the efficiency is to use a backtracking algorithm to generate permutations without repetitions. Here's an example implementation in Python:

def generate_permutations(sequence):
    used = set()

    def backtrack(curr_perm, remaining):
        if not remaining:
            yield curr_perm
        else:
            for i, char in enumerate(remaining):
                if char not in used:
                    used.add(char)
                    yield from backtrack(curr_perm + char, remaining[:i] + remaining[i+1:])
                    used.remove(char)

    yield from backtrack('', sequence)

sequence = 'AABBBCCCCCDDDDDEEEEE'
df = pd.DataFrame(columns=['Permutation'])
for perm in generate_permutations(sequence):
    df.loc[len(df)] = perm
    if len(df) % 1000 == 0:
        df = df.drop_duplicates(ignore_index=True)

In this code, the generate_permutations function uses a recursive backtracking approach to generate all permutations. The function maintains a set of used characters to avoid repetitions. It yields each valid permutation, which can be collected in the DataFrame df as you were doing before.

By using yield and yield from, we can generate permutations on the fly and iterate over them efficiently without having to store all permutations in memory simultaneously. This is particularly useful when dealing with large sequences or when memory usage is a concern.

By using backtracking and avoiding unnecessary permutations, this implementation should be more efficient compared to generating all permutations upfront. However, note that generating all permutations without repetitions for a sequence of length n still has a time complexity of O(n!), which can become impractical for large values of n.

The backtracking approach helps avoid generating duplicate permutations by using a set to keep track of used characters and skipping already used characters during the recursion. However, it does not reduce the overall complexity of generating all permutations without repetitions.

If you are dealing with a sequence of significant length, finding all permutations without repetitions may not be feasible due to the exponential nature of the problem. In such cases, you may need to consider alternative approaches or reevaluate the problem to find a more efficient solution or approximate the result in some way.

huangapple
  • 本文由 发表于 2023年6月1日 17:12:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/76380353.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定