从值和阈值高效创建一系列集合

huangapple go评论66阅读模式
英文:

Efficient creation of a sequence of sets from values and thresholds

问题

以下是您要求的翻译内容:

给定一个按升序排列的较短阈值序列和众多值(无序)。

期望的结果是一个包含一系列 set 的序列,第一个包含所有低于最低/第一个阈值的不同值;接下来的值不低于最低阈值,但低于第二个阈值(如果有的话);以此类推,直到最后一个阈值;最后包含所有不低于最高阈值的值。

关于 dict 有类似的问题(对于 有帮助的 解决方案的指导也欢迎),建议如下:

from itertools import pairwise

def partition(values, thresholds):
    """ 将值划分为一系列集合,这些集合在由阈值指定的右开区间中包含值。"""
    return [ { v for v in values if v < thresholds[0] }
       ] + [ { v for v in values if lo <= v < hi }
                                for lo, hi in tuple(pairwise(thresholds))
       ] + [ { v for v in values if thresholds[-1] <= v } ]

这个代码会对 values 迭代 len(thresholds)+1 次。

如何高效地创建一个根据 阈值values 进行划分的 set 序列?

我未能找到对 SciPy/NumPy 有帮助的东西。使用 numpy.digitize() 来索引一个 add() 成员的数组与非平凡值 阈值的 partition_Kelly3b() 在同一领域内。

英文:

Given a shortish sequence of thresholds ordered ascendantly and numerous values (unordered).

Wanted result is a sequence of sets, first one containing all distinct values below the lowest/first threshold; next values not below lowest threshold, but below 2nd threshold, if any; and so on 'til last threshold; finally all values not below highest threshold.

There are similar questions about dicts (pointers to helpful solutions there welcome, too),
suggestions amounting to

from itertools import pairwise

def partition(values, thresholds):
    """ Partition values into a list of sets 
        with values in right-open intervals specified by thresholds.
    """
    return [ { v for v in values if v < thresholds[0] }
       ] + [ { v for v in values if lo <= v < hi }
                                for lo, hi in tuple(pairwise(thresholds))
       ] + [ { v for v in values if thresholds[-1] <= v } ]

This "iterates" values len(thresholds)+1 times.

How to efficiently create a sequence of sets partitioning values according to thresholds?

I failed to find something helpful SciPy/NumPy.
Using numpy.digitize() to index an array of add() members was in the ball park of partition_Kelly3b() for non-trivial values and thresholds.

答案1

得分: 2

各种解决方案的测试结果:

10000个数和3个阈值:
  0.83 ± 0.00毫秒 partition_Kelly4
  0.83 ± 0.00毫秒 partition_Kelly4c
  0.83 ± 0.01毫秒 partition_Kelly4a
  1.42 ± 0.02毫秒 partition_Kelly3b
  1.55 ± 0.02毫秒 partition_Kelly3
  1.76 ± 0.02毫秒 partition_Kelly2
  1.93 ± 0.00毫秒 partition_Kelly4b
  2.04 ± 0.01毫秒 partition_Kelly
  2.55 ± 0.03毫秒 partition_original

10000个数和10个阈值:
  0.86 ± 0.01毫秒 partition_Kelly4a
  0.87 ± 0.01毫秒 partition_Kelly4c
  0.88 ± 0.02毫秒 partition_Kelly4
  1.98 ± 0.03毫秒 partition_Kelly4b
  2.03 ± 0.03毫秒 partition_Kelly3b
  2.06 ± 0.05毫秒 partition_Kelly2
  2.22 ± 0.05毫秒 partition_Kelly3
  2.52 ± 0.02毫秒 partition_Kelly
  6.19 ± 0.19毫秒 partition_original

10000个数和100个阈值:
  0.94 ± 0.02毫秒 partition_Kelly4a
  0.97 ± 0.02毫秒 partition_Kelly4
  0.99 ± 0.02毫秒 partition_Kelly4c
  2.05 ± 0.03毫秒 partition_Kelly4b
  2.62 ± 0.17毫秒 partition_Kelly2
  3.41 ± 0.05毫秒 partition_Kelly
  3.58 ± 0.33毫秒 partition_Kelly3b
  3.91 ± 0.25毫秒 partition_Kelly3
 60.49 ± 10.98毫秒 partition_original

错误:
  partition_dankal444
英文:

Test results of various solutions:

10000 values and 3 thresholds:
  0.83 ± 0.00 ms  partition_Kelly4
  0.83 ± 0.00 ms  partition_Kelly4c
  0.83 ± 0.01 ms  partition_Kelly4a
  1.42 ± 0.02 ms  partition_Kelly3b
  1.55 ± 0.02 ms  partition_Kelly3
  1.76 ± 0.02 ms  partition_Kelly2
  1.93 ± 0.00 ms  partition_Kelly4b
  2.04 ± 0.01 ms  partition_Kelly
  2.55 ± 0.03 ms  partition_original

10000 values and 10 thresholds:
  0.86 ± 0.01 ms  partition_Kelly4a
  0.87 ± 0.01 ms  partition_Kelly4c
  0.88 ± 0.02 ms  partition_Kelly4
  1.98 ± 0.03 ms  partition_Kelly4b
  2.03 ± 0.03 ms  partition_Kelly3b
  2.06 ± 0.05 ms  partition_Kelly2
  2.22 ± 0.05 ms  partition_Kelly3
  2.52 ± 0.02 ms  partition_Kelly
  6.19 ± 0.19 ms  partition_original

10000 values and 100 thresholds:
  0.94 ± 0.02 ms  partition_Kelly4a
  0.97 ± 0.02 ms  partition_Kelly4
  0.99 ± 0.02 ms  partition_Kelly4c
  2.05 ± 0.03 ms  partition_Kelly4b
  2.62 ± 0.17 ms  partition_Kelly2
  3.41 ± 0.05 ms  partition_Kelly
  3.58 ± 0.33 ms  partition_Kelly3b
  3.91 ± 0.25 ms  partition_Kelly3
 60.49 ± 10.98 ms  partition_original

Wrong:
  partition_dankal444

Code (Attempt This Online!):

from itertools import pairwise
import random
from bisect import bisect, bisect_left
from time import perf_counter as time
from statistics import mean, stdev


def partition_original(values, thresholds):
    """ Partition values into a list of sets 
        with values in right-open intervals specified by thresholds.
    """
    return [ { v for v in values if v < thresholds[0] }
       ] + [ { v for v in values if lo <= v < hi }
                                for lo, hi in tuple(pairwise(thresholds))
       ] + [ { v for v in values if thresholds[-1] <= v } ]


def partition_Kelly(values, thresholds):
    res = [set() for _ in thresholds]
    res.append(set())
    for v in values:
        i = bisect(thresholds, v)
        res[i].add(v)
    return res


def partition_Kelly2(values, thresholds):
    res = [set() for _ in thresholds]
    res.append(set())
    for v in set(values):
        i = bisect(thresholds, v)
        res[i].add(v)
    return res


def partition_Kelly3(values, thresholds):
    def partition(values, thresholds):
        if not thresholds:
            return [set(values)]
        i = len(thresholds) // 2
        threshold = thresholds[i]
        hi = []
        lo = [x for x in values if x < threshold or hi.append(x)]
        return partition(lo, thresholds[:i]) + partition(hi, thresholds[i+1:])
    return partition(set(values), thresholds)


def partition_Kelly3b(values, thresholds):
    def partition(values, thresholds):
        if not thresholds:
            return [set(values)]
        i = len(thresholds) // 2
        threshold = thresholds[i]
        hi = [].append
        lo = [x for x in values if x < threshold or hi(x)]
        return partition(lo, thresholds[:i]) + partition(hi.__self__, thresholds[i+1:])
    return partition(set(values), thresholds)


def partition_Kelly4(values, thresholds):
    values = sorted(set(values))
    res = []
    i = 0
    for threshold in thresholds:
        j = bisect_left(values, threshold)
        res.append(set(values[i:j]))
        i = j
    res.append(set(values[i:]))
    return res


def partition_Kelly4a(values, thresholds):
    values = sorted(set(values))
    res = []
    i = 0
    for threshold in thresholds:
        j = bisect_left(values, threshold, i)
        res.append(set(values[i:j]))
        i = j
    res.append(set(values[i:]))
    return res


def partition_Kelly4b(values, thresholds):
    values = sorted(values)
    res = []
    i = 0
    for threshold in thresholds:
        j = bisect_left(values, threshold)
        res.append(set(values[i:j]))
        i = j
    res.append(set(values[i:]))
    return res


def partition_Kelly4c(values, thresholds):
    def partition(start, stop, thresholds):
        if not thresholds:
            return [set(values[start:stop])]
        i = len(thresholds) // 2
        threshold = thresholds[i]
        j = bisect_left(values, threshold, start, stop)
        return partition(start, j, thresholds[:i]) + partition(j, stop, thresholds[i+1:])
    values = sorted(set(values))
    return partition(0, len(values), thresholds)


def partition_dankal444(values, thresholds):
    import bisect
    sets = [set() for _ in range(len(thresholds) + 1)]

    for value in values:
        set_idx = bisect.bisect_left(thresholds, value)
        sets[set_idx].add(value)
    return sets


funcs = [partition_original, partition_Kelly, partition_Kelly2, partition_Kelly3, partition_Kelly3b, partition_Kelly4, partition_Kelly4a, partition_Kelly4b, partition_Kelly4c, partition_dankal444]
wrong = set()

def test(n, k, repeat):
    print(n, 'values and', k, 'thresholds:')
    t0 = time()

    times = {f: [] for f in funcs}
    def stats(f):
        ts = [t * 1e3 for t in sorted(times[f])[:5]]
        return f'{mean(ts):6.2f} ± {stdev(ts):4.2f} ms '

    for _ in range(repeat):
        values = random.choices(range(n), k=n)
        thresholds = [int((i+.5)/k * n) for i in range(k)]
        expect = none = object()
        for f in funcs:
            t = time()
            result = f(values, thresholds)
            times[f].append(time() - t)
            if expect is none:
                expect = result
            elif result != expect:
                wrong.add(f)
        funcs[:] = [f for f in funcs if f not in wrong]

    for f in sorted(funcs, key=stats):
        print(stats(f), f.__name__)

    print()#time() - t0)

test(10**4, 3, 100)
test(10**4, 10, 80)
test(10**4, 10**2, 25)
print('Wrong:')
for f in wrong:
    print(' ', f.__name__)

答案2

得分: -1

使用 bisect.bisect_left 函数(文档

import bisect
sets = [set() for _ in range(len(thresholds) + 1)]

for value in values:
    set_idx = bisect.bisect_left(thresholds, value)
    sets[set_idx].add(value)

请注意,对于某些值/阈值大小,可能会更慢。随着大小的增加,性能应该比 OP 方法更好(O(nlogn) 对比 O(n^2)

英文:

Using bisect.bisect_left function (docs)

import bisect
sets = [set() for _ in range(len(thresholds) + 1)]
for value in values:
set_idx = bisect.bisect_left(thresholds, value)
sets[set_idx].add(value)

Note it might be slower for some sizes of values/thresholds. Larger the sizes performance should get better compared to OP method (O(nlogn) vs O(n^2))

huangapple
  • 本文由 发表于 2023年5月14日 19:12:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76247167.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定