python 在设定时间内随机生成两个列表之间所有唯一的元素对组合

vngu2lb8 于 2023-06-20 发布在 Python

关注(0)|答案(7)|浏览(107)

我有两个清单：

a = [1, 2, 3, 5]
b = ["a", "b", "c", "d"]

并想用python生成器生成所有可能的组合。我知道我可以做：

combinations = list(itertools.product(a,b))
random.shuffle(combinations)

但是，这一个有一个极端的内存成本，因为我将不得不在内存中保存所有可能的组合，即使只想两个随机的唯一组合。
我的目标是得到一个python生成器，它的内存开销随着向它请求的迭代次数的增加而增加，在最大迭代次数时达到与itertools相同的O内存开销。
我现在有这个：

def _unique_combinations(a: List, b: List):
    """
    Creates a generator that yields unique combinations of elements from a and b
    in the form of (a_element, b_element) tuples in a random order.
    """
    len_a, len_b = len(a), len(b)
    generated = set()
    for i in range(len_a):
        for j in range(len_b):
            while True:
                # choose random elements from a and b
                element_a = random.choice(a)
                element_b = random.choice(b)
                if (element_a, element_b) not in generated:
                    generated.add((element_a, element_b))
                    yield (element_a, element_b)
                    break

但它的缺陷，因为它可以在理论上永远运行，如果随机选择线是不幸的。
我期待修改现有的发电机，使其产生的索引在一个固定的时间内随机设置，这将是好的，让他们跟踪，因为这将是内存成本的线性增加，而不是指数。
我怎样才能修改随机索引生成器以在时间上绑定？

python

来源：https://stackoverflow.com/questions/76501267/randomly-generate-all-unique-pair-wise-combination-of-elements-between-two-list

7条答案

按热度按时间

xqnpmsa81#

乱采则始贱而终贵，穷尽则始贵而终贱。这里有一个“两全其美”的方法，我们在中途切换策略：

import itertools
import random
from typing import Iterator, TypeVar

_A = TypeVar("_A")
_B = TypeVar("_B")

def unique_product(a: list[_A], b: list[_B]) -> Iterator[tuple[_A, _B]]:
    total = len(a) * len(b)
    results: set[tuple[_A, _B]] = set()

    # For the first half of the results, use random guessing.
    # Until we've exhausted half of the possibility
    # space, we should average < 2 guesses per element, so
    # we can consider this to be amortized O(1) per element.
    result = random.choice(a), random.choice(b)
    while len(results) < total // 2:
        while result in results:
            result = random.choice(a), random.choice(b)
        results.add(result)
        yield result

    # For the second half, build the exhaustive set of
    # remaining results.  We pay an O(n) cost to do this but
    # amortized over the entire generator life it's O(1) per
    # element.  Our space usage goes down after this, not up.
    remaining: list[tuple[_A, _B]] = []
    for result in itertools.product(a, b):
        if result in results:
            results.remove(result)
        else:
            remaining.append(result)
    random.shuffle(remaining)
    while remaining:
        yield remaining.pop()

这种方法的主要潜在问题是，你在中间支付了一个很大的O（n）成本，所以即使当你查看整个运行时，它也会被洗掉，对于某些用例来说，让一个任意的调用者在中间一次性支付整个成本可能是不可取的，而不是预先支付，或者在所有调用者中均匀地分散它。（我可以想到一些方法来避免这种情况，在另一个线程中进行交换，但这会增加很多复杂性。也许有更好的办法）
请注意，就空间而言，这是非常理想的，因为你在中途最大化了空间（内存中有一半的元素），然后空间使用量减少到零，因为现在跟踪你没有分配的元素比你有分配的元素更便宜。

赞(0）回复(0）举报 2023-06-20

cld4siwp2#

我已经实现了stack overflow answer中建议的算法，它可以有效地完成您的要求，并且可以扩展到任何数量的维度。
我们使用一个素数和它的一个原根模n创建一个序列，该序列访问间隔中的每个数字一次。我们必须选择比乘积len(a)*len(b)稍大的素数，所以我们必须考虑索引错误的情况。

import random
from math import gcd

def next_prime(number):
    if number < 0:
        raise ValueError('Negative numbers can not be primes')
    # Base case
    if number <= 1:
        return 2

    # if even go back 1
    if number % 2 == 0:
        number -= 1
    while True:
        # only odds
        number += 2
        #only need to check up to and including the sqrt
        max_check = int(math.sqrt(number))+2
        # don't need to check even numbers
        for divider in range(3, max_check, 2):
            # if 'divider' divides 'number', then 'number' is not prime
            if number % divider == 0:
                break
        # if the for loop didn't break, then 'number' is prime
        else:
            return number

def is_primitive_root(a, n):
    phi = n - 1
    factors = set()
    for i in range(2, int(phi ** 0.5) + 1):
        if phi % i == 0:
            factors.add(i)
            factors.add(phi // i)
    for factor in factors:
        if pow(a, factor, n) == 1:
            return False
    return True



 
    
def find_random_primitive_root(n):
    while True:
        a = random.randint(2, n-1)
        if gcd(a, n) == 1 and is_primitive_root(a, n):
            return a



def advance_state(state, close_prime, root):
    # This walks the entire space without repetition
    state = (state * root) % close_prime
    return state

def sampler(l):
    close_prime = next_prime(l)
    state = root = find_random_primitive_root(close_prime)
    while state > l:
        state = advance_state(state, close_prime, root)
    yield state - 1
    for i in range(l - 1):
        state = advance_state(state, close_prime, root)
        while state > l:
            state = advance_state(state, close_prime, root)
        yield state - 1

然后，我们使用从1D -> 2D的Map将我们的序列号“翻译”为元组并产生结果。

def _unique_combinations(a, b):
    cartesian_product_cardinality = len(a) * len(b)
    sequence = sampler(cartesian_product_cardinality)
    for state in sequence:
        yield a[state // len(b)], b[state % len(b)]

from itertools import product

a = [1, 2, 3, 5]
b = ["a", "b", "c", "d"]
u = _unique_combinations(a, b)

assert sorted(u) == sorted(product(a, b))

我开始对各种方法进行基准测试。对于合并两个长度为1000的列表，@gog的divmod解决方案已经表现不佳，所以我将从进一步的测试中排除它：

kelly took 0.9156949520111084 seconds
divmod took 41.20149779319763 seconds
prime_roots took 0.5146901607513428 seconds
samwise took 0.698538064956665 seconds
fisher_yates took 0.902874231338501 seconds

对于其余的算法，我进行了以下基准测试

import pandas as pd
import timeit
import random
from itertools import combinations
from math import gcd
# Define the list lengths to benchmark
list_lengths = [10,20,30,100,300,500,1000,1500,2000,3000,5000]

num_repetitions = 2

results_df = pd.DataFrame(columns=['Approach', 'List Length', 'Execution Time'])

for approach, function in approaches.items():
    for length in list_lengths:
        a = list(range(length))
        b = list(range(length))

        execution_time = timeit.timeit(lambda: list(function(a, b)), number=num_repetitions)

        results_df = results_df.append({
            'Approach': approach,
            'List Length': length,
            'Execution Time': execution_time / num_repetitions
        }, ignore_index=True)

x1c 0d1x我通过试图找到这个问题的解决方案学到了很多：）

赞(0）回复(0）举报 2023-06-20

qyuhtwio3#

用数字(position_in_a * len_a) + position_in_b表示每个组合。继续随机生成这些数字，一旦一个数字被击中，只需将其递增mod len_a * len_b：

import random

def _unique_combinations(a, b):
    sa = len(a)
    sb = len(b)
    sx = sa * sb
    seen = set()
    while len(seen) < sx:
        n = random.randint(0, sx - 1)
        while n in seen:
            n = (n + 1) % sx
        seen.add(n)
        p, q = divmod(n, sa)
        yield a[q], b[p]

##

a = [1, 2, 3, 5]
b = ["a", "b", "c", "d"]
   
u = list(_unique_combinations(a, b))

print(u)

# confirm everything has been generated

from itertools import product
assert sorted(u) == sorted(product(a, b))

# [(3, 'c'), (1, 'c'), (5, 'c'), (2, 'b'), (5, 'b'), (1, 'b'), (2, 'c'), (3, 'd'), (2, 'a'), (3, 'b'), (1, 'd'), (2, 'd'), (1, 'a'), (5, 'd'), (3, 'a'), (5, 'a')]

赞(0）回复(0）举报 2023-06-20

ua4mk5z44#

这是你想要的吗？
您可以使用整数索引从所有可能组合的列表中生成任何组合：

def get_combination(x, values):

    digits = []
    for items in reversed(values):
        base = len(items)
        i = int(x % base)
        digits.append(items[i])
        x = x // base

    digits.reverse()

    return digits

values = [[1, 2, 3, 5], ["a", "b", "c", "d"]]
assert(get_combination(0, values) == [1, 'a'])
assert(get_combination(1, values) == [1, 'b'])
assert(get_combination(15, values) == [5, 'd'])

因此，您不需要生成混洗的组合列表。我不认为有任何方法可以在不重复的情况下迭代地对范围内的整数进行采样而不生成列表（如this question的答案中所解释的），但至少现在你只需要生成一个整数数组，这需要更少的内存：

import numpy as np

rng = np.random.default_rng()

def shuffled_combinations(values):
    counts = [len(x) for x in values]
    n = np.array(counts).prod()
    index = np.arange(n, dtype='uint')
    rng.shuffle(index)
    for i in index:
        yield get_combination(i, values)

for c in shuffled_combinations(values):
    print(c)

输出：

[1, 'd']
[2, 'c']
[3, 'a']
[5, 'a']
[5, 'd']
[5, 'b']
[3, 'c']
[2, 'b']
[1, 'a']
[1, 'b']
[5, 'c']
[1, 'c']
[3, 'b']
[2, 'a']
[2, 'd']
[3, 'd']

赞(0）回复(0）举报 2023-06-20

vs91vp4v5#

Samwise's的变体，但通过将其创建分散在过程的前半部分来避免创建remaining的大中间成本，并在过程的后半部分将其随机化。从集合到列表的转换相对较快。
我怀疑它比Samwise的整体速度慢（而且它确实使用了更多的内存）。如果中间的延迟是不可接受的，那就更好了。

import random
import itertools

def random_order_product(a, b):
    yielded = set()
    remaining = set()

    for i, pair in enumerate(itertools.product(a, b)):
        if pair not in yielded:
            remaining.add(pair)
        if i % 2:
             pair = random.choice(a), random.choice(b)
             while pair in yielded:
                 pair = random.choice(a), random.choice(b)
             yield pair
             yielded.add(pair)
             remaining.discard(pair)

    remaining = list(remaining)
    while remaining:
        i = random.randrange(len(remaining))
        yield remaining[i]
        remaining[i] = remaining[-1]
        remaining.pop()

# Demo showing the frequencies of the 4!=24 possible orders
from collections import Counter
print(sorted(Counter(
    tuple(random_order_product('ab', 'cd'))
    for _ in range(100000)
).values()))

阶次频率的采样输出（Attempt This Online!）：

[4045, 4078, 4107, 4112, 4113, 4113,
 4127, 4131, 4131, 4135, 4136, 4142,
 4149, 4164, 4172, 4186, 4188, 4196,
 4212, 4235, 4245, 4260, 4279, 4344]

赞(0）回复(0）举报 2023-06-20

cnjp1d6j6#

在你写这样一个程序之前，请允许我向你介绍一下“排列与组合”。假设您有一个水果列表（fruits =['apples'，'mangoes'，'grapes']）。可以排列列表的次数称为排列。这在数学上表示为（！）。现在，我们的列表包含三个项目。我们可以通过（3！），其等于6。现在，你只有六个移动或可能的 Shuffle 。另一方面，组合基本上是从列表中选择特定项目的一定数量的排列，例如，假设在我们的列表中，你想找出两个项目的组合数量。这可以在数学上表示为（2C 3），其中2是项目的数量，3是项目的总数。这将给予你3。但是，在Python中，我建议你使用itertools。这是一个令人惊叹的模块，将使您的工作更容易。但是，我希望您访问以下链接以获得更多见解。https://www.digitalocean.com/community/tutorials/permutation-and-combinatios-in-python

赞(0）回复(0）举报 2023-06-20

vfwfrxfs7#

@gog：代码片段在可扩展性方面有限制。它利用集合来跟踪生成的组合，随着可能组合的总数增加，内存使用和性能变得有问题。

赞(0）回复(0）举报 2023-06-20