2023年5月14日 19:12:40go评论110阅读模式

英文:

Efficient creation of a sequence of sets from values and thresholds

问题

以下是您要求的翻译内容：

给定一个按升序排列的较短阈值序列和众多值（无序）。

期望的结果是一个包含一系列 set 的序列，第一个包含所有低于最低/第一个阈值的不同值；接下来的值不低于最低阈值，但低于第二个阈值（如果有的话）；以此类推，直到最后一个阈值；最后包含所有不低于最高阈值的值。

关于 dict 有类似的问题（对于 有帮助的 解决方案的指导也欢迎），建议如下：

from itertools import pairwise
def partition(values, thresholds):
    &quot;&quot;&quot; 将值划分为一系列集合，这些集合在由阈值指定的右开区间中包含值。&quot;&quot;&quot;
    return [ { v for v in values if v &lt; thresholds[0] }
       ] + [ { v for v in values if lo &lt;= v &lt; hi }
                                for lo, hi in tuple(pairwise(thresholds))
       ] + [ { v for v in values if thresholds[-1] &lt;= v } ]

这个代码会对 values 迭代 len(thresholds)+1 次。

如何高效地创建一个根据阈值对 values 进行划分的 set 序列？

我未能找到对 SciPy/NumPy 有帮助的东西。使用 numpy.digitize() 来索引一个 add() 成员的数组与非平凡值和阈值的 partition_Kelly3b() 在同一领域内。

英文:

Given a shortish sequence of thresholds ordered ascendantly and numerous values (unordered).

Wanted result is a sequence of sets, first one containing all distinct values below the lowest/first threshold; next values not below lowest threshold, but below 2nd threshold, if any; and so on 'til last threshold; finally all values not below highest threshold.

There are similar questions about dicts (pointers to helpful solutions there welcome, too),
suggestions amounting to

from itertools import pairwise
def partition(values, thresholds):
    &quot;&quot;&quot; Partition values into a list of sets 
        with values in right-open intervals specified by thresholds.
    &quot;&quot;&quot;
    return [ { v for v in values if v &lt; thresholds[0] }
       ] + [ { v for v in values if lo &lt;= v &lt; hi }
                                for lo, hi in tuple(pairwise(thresholds))
       ] + [ { v for v in values if thresholds[-1] &lt;= v } ]

This "iterates" values len(thresholds)+1 times.

How to efficiently create a sequence of sets partitioning values according to thresholds?

I failed to find something helpful SciPy/NumPy.
Using numpy.digitize() to index an array of add() members was in the ball park of partition_Kelly3b() for non-trivial values and thresholds.

答案1

得分: 2

各种解决方案的测试结果：

10000个数和3个阈值：
  0.83 ± 0.00毫秒 partition_Kelly4
  0.83 ± 0.00毫秒 partition_Kelly4c
  0.83 ± 0.01毫秒 partition_Kelly4a
  1.42 ± 0.02毫秒 partition_Kelly3b
  1.55 ± 0.02毫秒 partition_Kelly3
  1.76 ± 0.02毫秒 partition_Kelly2
  1.93 ± 0.00毫秒 partition_Kelly4b
  2.04 ± 0.01毫秒 partition_Kelly
  2.55 ± 0.03毫秒 partition_original
10000个数和10个阈值：
  0.86 ± 0.01毫秒 partition_Kelly4a
  0.87 ± 0.01毫秒 partition_Kelly4c
  0.88 ± 0.02毫秒 partition_Kelly4
  1.98 ± 0.03毫秒 partition_Kelly4b
  2.03 ± 0.03毫秒 partition_Kelly3b
  2.06 ± 0.05毫秒 partition_Kelly2
  2.22 ± 0.05毫秒 partition_Kelly3
  2.52 ± 0.02毫秒 partition_Kelly
  6.19 ± 0.19毫秒 partition_original
10000个数和100个阈值：
  0.94 ± 0.02毫秒 partition_Kelly4a
  0.97 ± 0.02毫秒 partition_Kelly4
  0.99 ± 0.02毫秒 partition_Kelly4c
  2.05 ± 0.03毫秒 partition_Kelly4b
  2.62 ± 0.17毫秒 partition_Kelly2
  3.41 ± 0.05毫秒 partition_Kelly
  3.58 ± 0.33毫秒 partition_Kelly3b
  3.91 ± 0.25毫秒 partition_Kelly3
 60.49 ± 10.98毫秒 partition_original
错误：
  partition_dankal444

英文:

Test results of various solutions:

10000 values and 3 thresholds:
  0.83 &#177; 0.00 ms  partition_Kelly4
  0.83 &#177; 0.00 ms  partition_Kelly4c
  0.83 &#177; 0.01 ms  partition_Kelly4a
  1.42 &#177; 0.02 ms  partition_Kelly3b
  1.55 &#177; 0.02 ms  partition_Kelly3
  1.76 &#177; 0.02 ms  partition_Kelly2
  1.93 &#177; 0.00 ms  partition_Kelly4b
  2.04 &#177; 0.01 ms  partition_Kelly
  2.55 &#177; 0.03 ms  partition_original
10000 values and 10 thresholds:
  0.86 &#177; 0.01 ms  partition_Kelly4a
  0.87 &#177; 0.01 ms  partition_Kelly4c
  0.88 &#177; 0.02 ms  partition_Kelly4
  1.98 &#177; 0.03 ms  partition_Kelly4b
  2.03 &#177; 0.03 ms  partition_Kelly3b
  2.06 &#177; 0.05 ms  partition_Kelly2
  2.22 &#177; 0.05 ms  partition_Kelly3
  2.52 &#177; 0.02 ms  partition_Kelly
  6.19 &#177; 0.19 ms  partition_original
10000 values and 100 thresholds:
  0.94 &#177; 0.02 ms  partition_Kelly4a
  0.97 &#177; 0.02 ms  partition_Kelly4
  0.99 &#177; 0.02 ms  partition_Kelly4c
  2.05 &#177; 0.03 ms  partition_Kelly4b
  2.62 &#177; 0.17 ms  partition_Kelly2
  3.41 &#177; 0.05 ms  partition_Kelly
  3.58 &#177; 0.33 ms  partition_Kelly3b
  3.91 &#177; 0.25 ms  partition_Kelly3
 60.49 &#177; 10.98 ms  partition_original
Wrong:
  partition_dankal444

Code (Attempt This Online!):

from itertools import pairwise
import random
from bisect import bisect, bisect_left
from time import perf_counter as time
from statistics import mean, stdev
def partition_original(values, thresholds):
    &quot;&quot;&quot; Partition values into a list of sets 
        with values in right-open intervals specified by thresholds.
    &quot;&quot;&quot;
    return [ { v for v in values if v &lt; thresholds[0] }
       ] + [ { v for v in values if lo &lt;= v &lt; hi }
                                for lo, hi in tuple(pairwise(thresholds))
       ] + [ { v for v in values if thresholds[-1] &lt;= v } ]
def partition_Kelly(values, thresholds):
    res = [set() for _ in thresholds]
    res.append(set())
    for v in values:
        i = bisect(thresholds, v)
        res[i].add(v)
    return res
def partition_Kelly2(values, thresholds):
    res = [set() for _ in thresholds]
    res.append(set())
    for v in set(values):
        i = bisect(thresholds, v)
        res[i].add(v)
    return res
def partition_Kelly3(values, thresholds):
    def partition(values, thresholds):
        if not thresholds:
            return [set(values)]
        i = len(thresholds) // 2
        threshold = thresholds[i]
        hi = []
        lo = [x for x in values if x &lt; threshold or hi.append(x)]
        return partition(lo, thresholds[:i]) + partition(hi, thresholds[i+1:])
    return partition(set(values), thresholds)
def partition_Kelly3b(values, thresholds):
    def partition(values, thresholds):
        if not thresholds:
            return [set(values)]
        i = len(thresholds) // 2
        threshold = thresholds[i]
        hi = [].append
        lo = [x for x in values if x &lt; threshold or hi(x)]
        return partition(lo, thresholds[:i]) + partition(hi.__self__, thresholds[i+1:])
    return partition(set(values), thresholds)
def partition_Kelly4(values, thresholds):
    values = sorted(set(values))
    res = []
    i = 0
    for threshold in thresholds:
        j = bisect_left(values, threshold)
        res.append(set(values[i:j]))
        i = j
    res.append(set(values[i:]))
    return res
def partition_Kelly4a(values, thresholds):
    values = sorted(set(values))
    res = []
    i = 0
    for threshold in thresholds:
        j = bisect_left(values, threshold, i)
        res.append(set(values[i:j]))
        i = j
    res.append(set(values[i:]))
    return res
def partition_Kelly4b(values, thresholds):
    values = sorted(values)
    res = []
    i = 0
    for threshold in thresholds:
        j = bisect_left(values, threshold)
        res.append(set(values[i:j]))
        i = j
    res.append(set(values[i:]))
    return res
def partition_Kelly4c(values, thresholds):
    def partition(start, stop, thresholds):
        if not thresholds:
            return [set(values[start:stop])]
        i = len(thresholds) // 2
        threshold = thresholds[i]
        j = bisect_left(values, threshold, start, stop)
        return partition(start, j, thresholds[:i]) + partition(j, stop, thresholds[i+1:])
    values = sorted(set(values))
    return partition(0, len(values), thresholds)
def partition_dankal444(values, thresholds):
    import bisect
    sets = [set() for _ in range(len(thresholds) + 1)]
    for value in values:
        set_idx = bisect.bisect_left(thresholds, value)
        sets[set_idx].add(value)
    return sets
funcs = [partition_original, partition_Kelly, partition_Kelly2, partition_Kelly3, partition_Kelly3b, partition_Kelly4, partition_Kelly4a, partition_Kelly4b, partition_Kelly4c, partition_dankal444]
wrong = set()
def test(n, k, repeat):
    print(n, &#39;values and&#39;, k, &#39;thresholds:&#39;)
    t0 = time()
    times = {f: [] for f in funcs}
    def stats(f):
        ts = [t * 1e3 for t in sorted(times[f])[:5]]
        return f&#39;{mean(ts):6.2f} &#177; {stdev(ts):4.2f} ms &#39;
    for _ in range(repeat):
        values = random.choices(range(n), k=n)
        thresholds = [int((i+.5)/k * n) for i in range(k)]
        expect = none = object()
        for f in funcs:
            t = time()
            result = f(values, thresholds)
            times[f].append(time() - t)
            if expect is none:
                expect = result
            elif result != expect:
                wrong.add(f)
        funcs[:] = [f for f in funcs if f not in wrong]
    for f in sorted(funcs, key=stats):
        print(stats(f), f.__name__)
    print()#time() - t0)
test(10**4, 3, 100)
test(10**4, 10, 80)
test(10**4, 10**2, 25)
print(&#39;Wrong:&#39;)
for f in wrong:
    print(&#39; &#39;, f.__name__)

答案2

得分: -1

使用 bisect.bisect_left 函数（文档）

import bisect
sets = [set() for _ in range(len(thresholds) + 1)]
for value in values:
    set_idx = bisect.bisect_left(thresholds, value)
    sets[set_idx].add(value)

请注意，对于某些值/阈值大小，可能会更慢。随着大小的增加，性能应该比 OP 方法更好（O(nlogn) 对比 O(n^2)）

英文:

Using bisect.bisect_left function (docs)

import bisect
sets = [set() for _ in range(len(thresholds) + 1)]
for value in values:
set_idx = bisect.bisect_left(thresholds, value)
sets[set_idx].add(value)

Note it might be slower for some sizes of values/thresholds. Larger the sizes performance should get better compared to OP method (O(nlogn) vs O(n^2))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从值和阈值高效创建一系列集合

问题

答案1

答案2

基于前一个和后一个值填充中间数值

使用Pandas/Python将一个DataFrame中指定的列替换为另一个列

IndexError: 列表索引超出范围，位置在 table_data[headers[len(table_data)]] = values

ModuleNotFoundError: 找不到模块名为 ‘pandas’ 的模块（ubuntu）

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。