从值和阈值高效创建一系列集合

huangapple go评论110阅读模式
英文:

Efficient creation of a sequence of sets from values and thresholds

问题

以下是您要求的翻译内容:

给定一个按升序排列的较短阈值序列和众多值(无序)。

期望的结果是一个包含一系列 set 的序列,第一个包含所有低于最低/第一个阈值的不同值;接下来的值不低于最低阈值,但低于第二个阈值(如果有的话);以此类推,直到最后一个阈值;最后包含所有不低于最高阈值的值。

关于 dict 有类似的问题(对于 有帮助的 解决方案的指导也欢迎),建议如下:

  1. from itertools import pairwise
  2. def partition(values, thresholds):
  3. """ 将值划分为一系列集合,这些集合在由阈值指定的右开区间中包含值。"""
  4. return [ { v for v in values if v < thresholds[0] }
  5. ] + [ { v for v in values if lo <= v < hi }
  6. for lo, hi in tuple(pairwise(thresholds))
  7. ] + [ { v for v in values if thresholds[-1] <= v } ]

这个代码会对 values 迭代 len(thresholds)+1 次。

如何高效地创建一个根据 阈值values 进行划分的 set 序列?

我未能找到对 SciPy/NumPy 有帮助的东西。使用 numpy.digitize() 来索引一个 add() 成员的数组与非平凡值 阈值的 partition_Kelly3b() 在同一领域内。

英文:

Given a shortish sequence of thresholds ordered ascendantly and numerous values (unordered).

Wanted result is a sequence of sets, first one containing all distinct values below the lowest/first threshold; next values not below lowest threshold, but below 2nd threshold, if any; and so on 'til last threshold; finally all values not below highest threshold.

There are similar questions about dicts (pointers to helpful solutions there welcome, too),
suggestions amounting to

  1. from itertools import pairwise
  2. def partition(values, thresholds):
  3. """ Partition values into a list of sets
  4. with values in right-open intervals specified by thresholds.
  5. """
  6. return [ { v for v in values if v < thresholds[0] }
  7. ] + [ { v for v in values if lo <= v < hi }
  8. for lo, hi in tuple(pairwise(thresholds))
  9. ] + [ { v for v in values if thresholds[-1] <= v } ]

This "iterates" values len(thresholds)+1 times.

How to efficiently create a sequence of sets partitioning values according to thresholds?

I failed to find something helpful SciPy/NumPy.
Using numpy.digitize() to index an array of add() members was in the ball park of partition_Kelly3b() for non-trivial values and thresholds.

答案1

得分: 2

各种解决方案的测试结果:

  1. 10000个数和3个阈值:
  2. 0.83 ± 0.00毫秒 partition_Kelly4
  3. 0.83 ± 0.00毫秒 partition_Kelly4c
  4. 0.83 ± 0.01毫秒 partition_Kelly4a
  5. 1.42 ± 0.02毫秒 partition_Kelly3b
  6. 1.55 ± 0.02毫秒 partition_Kelly3
  7. 1.76 ± 0.02毫秒 partition_Kelly2
  8. 1.93 ± 0.00毫秒 partition_Kelly4b
  9. 2.04 ± 0.01毫秒 partition_Kelly
  10. 2.55 ± 0.03毫秒 partition_original
  11. 10000个数和10个阈值:
  12. 0.86 ± 0.01毫秒 partition_Kelly4a
  13. 0.87 ± 0.01毫秒 partition_Kelly4c
  14. 0.88 ± 0.02毫秒 partition_Kelly4
  15. 1.98 ± 0.03毫秒 partition_Kelly4b
  16. 2.03 ± 0.03毫秒 partition_Kelly3b
  17. 2.06 ± 0.05毫秒 partition_Kelly2
  18. 2.22 ± 0.05毫秒 partition_Kelly3
  19. 2.52 ± 0.02毫秒 partition_Kelly
  20. 6.19 ± 0.19毫秒 partition_original
  21. 10000个数和100个阈值:
  22. 0.94 ± 0.02毫秒 partition_Kelly4a
  23. 0.97 ± 0.02毫秒 partition_Kelly4
  24. 0.99 ± 0.02毫秒 partition_Kelly4c
  25. 2.05 ± 0.03毫秒 partition_Kelly4b
  26. 2.62 ± 0.17毫秒 partition_Kelly2
  27. 3.41 ± 0.05毫秒 partition_Kelly
  28. 3.58 ± 0.33毫秒 partition_Kelly3b
  29. 3.91 ± 0.25毫秒 partition_Kelly3
  30. 60.49 ± 10.98毫秒 partition_original
  31. 错误:
  32. partition_dankal444
英文:

Test results of various solutions:

  1. 10000 values and 3 thresholds:
  2. 0.83 ± 0.00 ms partition_Kelly4
  3. 0.83 ± 0.00 ms partition_Kelly4c
  4. 0.83 ± 0.01 ms partition_Kelly4a
  5. 1.42 ± 0.02 ms partition_Kelly3b
  6. 1.55 ± 0.02 ms partition_Kelly3
  7. 1.76 ± 0.02 ms partition_Kelly2
  8. 1.93 ± 0.00 ms partition_Kelly4b
  9. 2.04 ± 0.01 ms partition_Kelly
  10. 2.55 ± 0.03 ms partition_original
  11. 10000 values and 10 thresholds:
  12. 0.86 ± 0.01 ms partition_Kelly4a
  13. 0.87 ± 0.01 ms partition_Kelly4c
  14. 0.88 ± 0.02 ms partition_Kelly4
  15. 1.98 ± 0.03 ms partition_Kelly4b
  16. 2.03 ± 0.03 ms partition_Kelly3b
  17. 2.06 ± 0.05 ms partition_Kelly2
  18. 2.22 ± 0.05 ms partition_Kelly3
  19. 2.52 ± 0.02 ms partition_Kelly
  20. 6.19 ± 0.19 ms partition_original
  21. 10000 values and 100 thresholds:
  22. 0.94 ± 0.02 ms partition_Kelly4a
  23. 0.97 ± 0.02 ms partition_Kelly4
  24. 0.99 ± 0.02 ms partition_Kelly4c
  25. 2.05 ± 0.03 ms partition_Kelly4b
  26. 2.62 ± 0.17 ms partition_Kelly2
  27. 3.41 ± 0.05 ms partition_Kelly
  28. 3.58 ± 0.33 ms partition_Kelly3b
  29. 3.91 ± 0.25 ms partition_Kelly3
  30. 60.49 ± 10.98 ms partition_original
  31. Wrong:
  32. partition_dankal444

Code (Attempt This Online!):

  1. from itertools import pairwise
  2. import random
  3. from bisect import bisect, bisect_left
  4. from time import perf_counter as time
  5. from statistics import mean, stdev
  6. def partition_original(values, thresholds):
  7. """ Partition values into a list of sets
  8. with values in right-open intervals specified by thresholds.
  9. """
  10. return [ { v for v in values if v < thresholds[0] }
  11. ] + [ { v for v in values if lo <= v < hi }
  12. for lo, hi in tuple(pairwise(thresholds))
  13. ] + [ { v for v in values if thresholds[-1] <= v } ]
  14. def partition_Kelly(values, thresholds):
  15. res = [set() for _ in thresholds]
  16. res.append(set())
  17. for v in values:
  18. i = bisect(thresholds, v)
  19. res[i].add(v)
  20. return res
  21. def partition_Kelly2(values, thresholds):
  22. res = [set() for _ in thresholds]
  23. res.append(set())
  24. for v in set(values):
  25. i = bisect(thresholds, v)
  26. res[i].add(v)
  27. return res
  28. def partition_Kelly3(values, thresholds):
  29. def partition(values, thresholds):
  30. if not thresholds:
  31. return [set(values)]
  32. i = len(thresholds) // 2
  33. threshold = thresholds[i]
  34. hi = []
  35. lo = [x for x in values if x < threshold or hi.append(x)]
  36. return partition(lo, thresholds[:i]) + partition(hi, thresholds[i+1:])
  37. return partition(set(values), thresholds)
  38. def partition_Kelly3b(values, thresholds):
  39. def partition(values, thresholds):
  40. if not thresholds:
  41. return [set(values)]
  42. i = len(thresholds) // 2
  43. threshold = thresholds[i]
  44. hi = [].append
  45. lo = [x for x in values if x < threshold or hi(x)]
  46. return partition(lo, thresholds[:i]) + partition(hi.__self__, thresholds[i+1:])
  47. return partition(set(values), thresholds)
  48. def partition_Kelly4(values, thresholds):
  49. values = sorted(set(values))
  50. res = []
  51. i = 0
  52. for threshold in thresholds:
  53. j = bisect_left(values, threshold)
  54. res.append(set(values[i:j]))
  55. i = j
  56. res.append(set(values[i:]))
  57. return res
  58. def partition_Kelly4a(values, thresholds):
  59. values = sorted(set(values))
  60. res = []
  61. i = 0
  62. for threshold in thresholds:
  63. j = bisect_left(values, threshold, i)
  64. res.append(set(values[i:j]))
  65. i = j
  66. res.append(set(values[i:]))
  67. return res
  68. def partition_Kelly4b(values, thresholds):
  69. values = sorted(values)
  70. res = []
  71. i = 0
  72. for threshold in thresholds:
  73. j = bisect_left(values, threshold)
  74. res.append(set(values[i:j]))
  75. i = j
  76. res.append(set(values[i:]))
  77. return res
  78. def partition_Kelly4c(values, thresholds):
  79. def partition(start, stop, thresholds):
  80. if not thresholds:
  81. return [set(values[start:stop])]
  82. i = len(thresholds) // 2
  83. threshold = thresholds[i]
  84. j = bisect_left(values, threshold, start, stop)
  85. return partition(start, j, thresholds[:i]) + partition(j, stop, thresholds[i+1:])
  86. values = sorted(set(values))
  87. return partition(0, len(values), thresholds)
  88. def partition_dankal444(values, thresholds):
  89. import bisect
  90. sets = [set() for _ in range(len(thresholds) + 1)]
  91. for value in values:
  92. set_idx = bisect.bisect_left(thresholds, value)
  93. sets[set_idx].add(value)
  94. return sets
  95. funcs = [partition_original, partition_Kelly, partition_Kelly2, partition_Kelly3, partition_Kelly3b, partition_Kelly4, partition_Kelly4a, partition_Kelly4b, partition_Kelly4c, partition_dankal444]
  96. wrong = set()
  97. def test(n, k, repeat):
  98. print(n, 'values and', k, 'thresholds:')
  99. t0 = time()
  100. times = {f: [] for f in funcs}
  101. def stats(f):
  102. ts = [t * 1e3 for t in sorted(times[f])[:5]]
  103. return f'{mean(ts):6.2f} ± {stdev(ts):4.2f} ms '
  104. for _ in range(repeat):
  105. values = random.choices(range(n), k=n)
  106. thresholds = [int((i+.5)/k * n) for i in range(k)]
  107. expect = none = object()
  108. for f in funcs:
  109. t = time()
  110. result = f(values, thresholds)
  111. times[f].append(time() - t)
  112. if expect is none:
  113. expect = result
  114. elif result != expect:
  115. wrong.add(f)
  116. funcs[:] = [f for f in funcs if f not in wrong]
  117. for f in sorted(funcs, key=stats):
  118. print(stats(f), f.__name__)
  119. print()#time() - t0)
  120. test(10**4, 3, 100)
  121. test(10**4, 10, 80)
  122. test(10**4, 10**2, 25)
  123. print('Wrong:')
  124. for f in wrong:
  125. print(' ', f.__name__)

答案2

得分: -1

使用 bisect.bisect_left 函数(文档

  1. import bisect
  2. sets = [set() for _ in range(len(thresholds) + 1)]
  3. for value in values:
  4. set_idx = bisect.bisect_left(thresholds, value)
  5. sets[set_idx].add(value)

请注意,对于某些值/阈值大小,可能会更慢。随着大小的增加,性能应该比 OP 方法更好(O(nlogn) 对比 O(n^2)

英文:

Using bisect.bisect_left function (docs)

  1. import bisect
  2. sets = [set() for _ in range(len(thresholds) + 1)]
  3. for value in values:
  4. set_idx = bisect.bisect_left(thresholds, value)
  5. sets[set_idx].add(value)

Note it might be slower for some sizes of values/thresholds. Larger the sizes performance should get better compared to OP method (O(nlogn) vs O(n^2))

huangapple
  • 本文由 发表于 2023年5月14日 19:12:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76247167.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定