英文:
Efficient creation of a sequence of sets from values and thresholds
问题
以下是您要求的翻译内容:
给定一个按升序排列的较短阈值序列和众多值(无序)。
期望的结果是一个包含一系列 set
的序列,第一个包含所有低于最低/第一个阈值的不同值;接下来的值不低于最低阈值,但低于第二个阈值(如果有的话);以此类推,直到最后一个阈值;最后包含所有不低于最高阈值的值。
关于 dict
有类似的问题(对于 有帮助的 解决方案的指导也欢迎),建议如下:
from itertools import pairwise
def partition(values, thresholds):
""" 将值划分为一系列集合,这些集合在由阈值指定的右开区间中包含值。"""
return [ { v for v in values if v < thresholds[0] }
] + [ { v for v in values if lo <= v < hi }
for lo, hi in tuple(pairwise(thresholds))
] + [ { v for v in values if thresholds[-1] <= v } ]
这个代码会对 values
迭代 len(thresholds)+1
次。
如何高效地创建一个根据 阈值 对 values 进行划分的 set
序列?
我未能找到对 SciPy/NumPy 有帮助的东西。使用 numpy.digitize()
来索引一个 add()
成员的数组与非平凡值 和 阈值的 partition_Kelly3b()
在同一领域内。
英文:
Given a shortish sequence of thresholds ordered ascendantly and numerous values (unordered).
Wanted result is a sequence of set
s, first one containing all distinct values below the lowest/first threshold; next values not below lowest threshold, but below 2nd threshold, if any; and so on 'til last threshold; finally all values not below highest threshold.
There are similar questions about dict
s (pointers to helpful solutions there welcome, too),
suggestions amounting to
from itertools import pairwise
def partition(values, thresholds):
""" Partition values into a list of sets
with values in right-open intervals specified by thresholds.
"""
return [ { v for v in values if v < thresholds[0] }
] + [ { v for v in values if lo <= v < hi }
for lo, hi in tuple(pairwise(thresholds))
] + [ { v for v in values if thresholds[-1] <= v } ]
This "iterates" values
len(thresholds)+1
times.
How to efficiently create a sequence of set
s partitioning values according to thresholds?
I failed to find something helpful SciPy/NumPy.
Using numpy.digitize()
to index an array of add()
members was in the ball park of partition_Kelly3b()
for non-trivial values and thresholds.
答案1
得分: 2
各种解决方案的测试结果:
10000个数和3个阈值:
0.83 ± 0.00毫秒 partition_Kelly4
0.83 ± 0.00毫秒 partition_Kelly4c
0.83 ± 0.01毫秒 partition_Kelly4a
1.42 ± 0.02毫秒 partition_Kelly3b
1.55 ± 0.02毫秒 partition_Kelly3
1.76 ± 0.02毫秒 partition_Kelly2
1.93 ± 0.00毫秒 partition_Kelly4b
2.04 ± 0.01毫秒 partition_Kelly
2.55 ± 0.03毫秒 partition_original
10000个数和10个阈值:
0.86 ± 0.01毫秒 partition_Kelly4a
0.87 ± 0.01毫秒 partition_Kelly4c
0.88 ± 0.02毫秒 partition_Kelly4
1.98 ± 0.03毫秒 partition_Kelly4b
2.03 ± 0.03毫秒 partition_Kelly3b
2.06 ± 0.05毫秒 partition_Kelly2
2.22 ± 0.05毫秒 partition_Kelly3
2.52 ± 0.02毫秒 partition_Kelly
6.19 ± 0.19毫秒 partition_original
10000个数和100个阈值:
0.94 ± 0.02毫秒 partition_Kelly4a
0.97 ± 0.02毫秒 partition_Kelly4
0.99 ± 0.02毫秒 partition_Kelly4c
2.05 ± 0.03毫秒 partition_Kelly4b
2.62 ± 0.17毫秒 partition_Kelly2
3.41 ± 0.05毫秒 partition_Kelly
3.58 ± 0.33毫秒 partition_Kelly3b
3.91 ± 0.25毫秒 partition_Kelly3
60.49 ± 10.98毫秒 partition_original
错误:
partition_dankal444
英文:
Test results of various solutions:
10000 values and 3 thresholds:
0.83 ± 0.00 ms partition_Kelly4
0.83 ± 0.00 ms partition_Kelly4c
0.83 ± 0.01 ms partition_Kelly4a
1.42 ± 0.02 ms partition_Kelly3b
1.55 ± 0.02 ms partition_Kelly3
1.76 ± 0.02 ms partition_Kelly2
1.93 ± 0.00 ms partition_Kelly4b
2.04 ± 0.01 ms partition_Kelly
2.55 ± 0.03 ms partition_original
10000 values and 10 thresholds:
0.86 ± 0.01 ms partition_Kelly4a
0.87 ± 0.01 ms partition_Kelly4c
0.88 ± 0.02 ms partition_Kelly4
1.98 ± 0.03 ms partition_Kelly4b
2.03 ± 0.03 ms partition_Kelly3b
2.06 ± 0.05 ms partition_Kelly2
2.22 ± 0.05 ms partition_Kelly3
2.52 ± 0.02 ms partition_Kelly
6.19 ± 0.19 ms partition_original
10000 values and 100 thresholds:
0.94 ± 0.02 ms partition_Kelly4a
0.97 ± 0.02 ms partition_Kelly4
0.99 ± 0.02 ms partition_Kelly4c
2.05 ± 0.03 ms partition_Kelly4b
2.62 ± 0.17 ms partition_Kelly2
3.41 ± 0.05 ms partition_Kelly
3.58 ± 0.33 ms partition_Kelly3b
3.91 ± 0.25 ms partition_Kelly3
60.49 ± 10.98 ms partition_original
Wrong:
partition_dankal444
Code (Attempt This Online!):
from itertools import pairwise
import random
from bisect import bisect, bisect_left
from time import perf_counter as time
from statistics import mean, stdev
def partition_original(values, thresholds):
""" Partition values into a list of sets
with values in right-open intervals specified by thresholds.
"""
return [ { v for v in values if v < thresholds[0] }
] + [ { v for v in values if lo <= v < hi }
for lo, hi in tuple(pairwise(thresholds))
] + [ { v for v in values if thresholds[-1] <= v } ]
def partition_Kelly(values, thresholds):
res = [set() for _ in thresholds]
res.append(set())
for v in values:
i = bisect(thresholds, v)
res[i].add(v)
return res
def partition_Kelly2(values, thresholds):
res = [set() for _ in thresholds]
res.append(set())
for v in set(values):
i = bisect(thresholds, v)
res[i].add(v)
return res
def partition_Kelly3(values, thresholds):
def partition(values, thresholds):
if not thresholds:
return [set(values)]
i = len(thresholds) // 2
threshold = thresholds[i]
hi = []
lo = [x for x in values if x < threshold or hi.append(x)]
return partition(lo, thresholds[:i]) + partition(hi, thresholds[i+1:])
return partition(set(values), thresholds)
def partition_Kelly3b(values, thresholds):
def partition(values, thresholds):
if not thresholds:
return [set(values)]
i = len(thresholds) // 2
threshold = thresholds[i]
hi = [].append
lo = [x for x in values if x < threshold or hi(x)]
return partition(lo, thresholds[:i]) + partition(hi.__self__, thresholds[i+1:])
return partition(set(values), thresholds)
def partition_Kelly4(values, thresholds):
values = sorted(set(values))
res = []
i = 0
for threshold in thresholds:
j = bisect_left(values, threshold)
res.append(set(values[i:j]))
i = j
res.append(set(values[i:]))
return res
def partition_Kelly4a(values, thresholds):
values = sorted(set(values))
res = []
i = 0
for threshold in thresholds:
j = bisect_left(values, threshold, i)
res.append(set(values[i:j]))
i = j
res.append(set(values[i:]))
return res
def partition_Kelly4b(values, thresholds):
values = sorted(values)
res = []
i = 0
for threshold in thresholds:
j = bisect_left(values, threshold)
res.append(set(values[i:j]))
i = j
res.append(set(values[i:]))
return res
def partition_Kelly4c(values, thresholds):
def partition(start, stop, thresholds):
if not thresholds:
return [set(values[start:stop])]
i = len(thresholds) // 2
threshold = thresholds[i]
j = bisect_left(values, threshold, start, stop)
return partition(start, j, thresholds[:i]) + partition(j, stop, thresholds[i+1:])
values = sorted(set(values))
return partition(0, len(values), thresholds)
def partition_dankal444(values, thresholds):
import bisect
sets = [set() for _ in range(len(thresholds) + 1)]
for value in values:
set_idx = bisect.bisect_left(thresholds, value)
sets[set_idx].add(value)
return sets
funcs = [partition_original, partition_Kelly, partition_Kelly2, partition_Kelly3, partition_Kelly3b, partition_Kelly4, partition_Kelly4a, partition_Kelly4b, partition_Kelly4c, partition_dankal444]
wrong = set()
def test(n, k, repeat):
print(n, 'values and', k, 'thresholds:')
t0 = time()
times = {f: [] for f in funcs}
def stats(f):
ts = [t * 1e3 for t in sorted(times[f])[:5]]
return f'{mean(ts):6.2f} ± {stdev(ts):4.2f} ms '
for _ in range(repeat):
values = random.choices(range(n), k=n)
thresholds = [int((i+.5)/k * n) for i in range(k)]
expect = none = object()
for f in funcs:
t = time()
result = f(values, thresholds)
times[f].append(time() - t)
if expect is none:
expect = result
elif result != expect:
wrong.add(f)
funcs[:] = [f for f in funcs if f not in wrong]
for f in sorted(funcs, key=stats):
print(stats(f), f.__name__)
print()#time() - t0)
test(10**4, 3, 100)
test(10**4, 10, 80)
test(10**4, 10**2, 25)
print('Wrong:')
for f in wrong:
print(' ', f.__name__)
答案2
得分: -1
使用 bisect.bisect_left
函数(文档)
import bisect
sets = [set() for _ in range(len(thresholds) + 1)]
for value in values:
set_idx = bisect.bisect_left(thresholds, value)
sets[set_idx].add(value)
请注意,对于某些值/阈值大小,可能会更慢。随着大小的增加,性能应该比 OP 方法更好(O(nlogn)
对比 O(n^2)
)
英文:
Using bisect.bisect_left
function (docs)
import bisect
sets = [set() for _ in range(len(thresholds) + 1)]
for value in values:
set_idx = bisect.bisect_left(thresholds, value)
sets[set_idx].add(value)
Note it might be slower for some sizes of values/thresholds. Larger the sizes performance should get better compared to OP method (O(nlogn)
vs O(n^2)
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论