英文:
Performance doesn't improve with Ray working on 4-CPU-cores
问题
我正在尝试在我的机器上重新运行**[tag:ray]**教程,但我未能复现教程中显示的性能改进。
这可能的原因是什么?我已经尝试寻找解决方案,但仍然无法理解。
英文:
I'm trying to rerun the [tag:ray] tutorial on my machine, but I'm failing to reproduce the performance improvements as shown in the tutorial.
What could be the reason for it? I have tried looking for solutions but still not able to understand.
答案1
得分: 3
I've translated the text as requested:
Q: 有什么可能的原因吗?
...几乎为零的**
[PARALLEL]
**代码执行部分
<br>
当附加开销添加到修订后的 Amdahl 时,“负面”速度 << 1(减速)变得明显
Amdahl 法则 定义了为什么,接下来是什么:
首先:<br>
不要在没有正确隔离 SuT(即 System-under-Test,这里是 computation 的 分布式 形式)的情况下开始“基准测试”。
在这里,start = time.time()
显然放在 import ray
语句的前面,似乎更像是对读者注意力的挑战,绝对不是一个正确设计的用例测试设计的迹象。你明知道在第二个测试中,测量时间也包括所有磁盘 I/O 延迟、从磁盘传输到 Python 会话的数据传输、导入模块的语法检查的时间成本,是解释(没有在第二个测试中具备相同条件)。
接下来:<br>
在削减了 import
成本之后,可以开始进行“苹果对苹果”的比较:
...
#----------------------------------------- 苹果对苹果(仍然非常幼稚)
start = time.time()
futures = [ f(i) for i in range(4) ]
print( time.time() - start )
print( 60*'_' + " 纯-[SERIAL] Python 执行" )
#----------------------------------------- 橙子
start = time.time()
import ray # 初始导入成本
ray.init( num_cpus = 4 ) # 参数化成本
@ray.remote # 装饰的函数成本
def f( x ):
return x * x
print( time.time() - start )
print( 60*'_' + " ray 附加开销" )
#----------------------------------------- 苹果对苹果(仍然非常幼稚)
start = time.time()
futures = [ f.remote(i) for i in range(4) ]
print( time.time() - start )
print( 60*'_' + " ray.remote 装饰的 Python 执行" )
接下来是扩展:
对于微小规模的使用情况,比如仅用于构建所有并行/分布式代码执行的工具(仅 -4- 次调用),测量是可能的,但会受到许多与硬件和软件相关的技巧的影响(内存分配和缓存副作用经常是性能障碍,一旦 SuT 被精心制作出来,不会掩盖这些典型的 HPC 核心问题)。
>>> def f( x ):
... return x * x
...
>>> dis.dis( f )
2 0 LOAD_FAST 0 (x)
3 LOAD_FAST 0 (x)
6 BINARY_MULTIPLY
7 RETURN_VALUE
拥有**“低密度”的计算**(这里只进行一次 MUL x, x
直接执行 RET
)永远无法证明所有初始设置成本和每次调用的附加开销成本是有道理的,这在计算密度较小的情况下重要,但在复杂和 CPU 密集型的 HPC 计算任务中不重要(对于这些任务,Amdahl 法则 表明实现速度提升的主要限制在哪里)。
下一段代码将展示分散在 4 个 CPU ray.remote
处理路径上的 f.remote()-调用的平均每次调用成本,与纯粹的、垄断性的 GIL 步进模式的处理成本进行比较(有关 [min、Avg、MAX、StDev] 的详细信息,请参阅其他基准测试帖子):
#----------------------------------------- 苹果对苹果扩展(仍然很幼稚)
test_4N = 1E6 # 1E6, 1E9,... 更大的值可能会因 range()-iterator 构造不佳(未被合并)而抛出异常,可以使用解决方法
start = time.time()
futures = [ f.remote(i) for i in range( int( test_4N ) ) ]
print( ( time.time() - start ) / test_4N )
print( 60*'_' + " ray.remote 装饰的 Python 执行每次调用成本" )
#----------------------------------------- 苹果对苹果扩展(仍然很幼稚)
start = time.time()
futures = [ f(i) for i in range( int( test_4N ) ) ]
print( time.time() - start / test_4N )
print( 60*'_' + " 纯-[SERIAL] Python 执行" )
然后是奖励部分:
Smoke-on:
如果确实有兴趣烧更多的燃料来理解如何从 [CONCURRENT]
、真正的 [PARALLEL]
或甚至 [tag:distributed-computing] 中获益,尝试添加更多 CPU 密集型计算、一些显著的 RAM 内存分配,超越 CPU 核心的 L3 缓存大小,通过参数和结果传递在进程之间传递更大的 BLOB,接近或稍微超过操作系统的有效进程切换和 RAM 交换,简而言之,走
英文:
> Q : What could be the reason for it?
<sub>... an extremely low ( almost Zero ) [PARALLEL]
code-execution portion
<br>
When add-on overheads are added to the revised-Amdahl, "negative" speedup << 1 ( slowdowns ) become obvious</sub>
The Amdahl's Law, defines the rationale WHY, next comes the WHAT :
First:<br>
never start "benchmarking" without having correctly isolated the SuT - the System-under-Test, here being the distributed-form of a computation.
Here, the start = time.time()
being " in front of " the import ray
statement seems to be rather a provocative test of the readers' concentration on subject, definitely not a sign of a properly engineered use-case test-design - you knowingly take into the measured time also all the disk-I/O latency + data-transfers from disk into the python session, TimeDOMAIN costs of syntax-checks of the imported module, yes - interpretation ( not having the same conditions in the second test )
Next:<br>
After shaving-off the costs of import
, one may start to compare "apples to apples":
...
#----------------------------------------- APPLES-TO-APPLES ( STILL AWFULLY NAIVE )
start = time.time()
futures = [ f(i) for i in range(4) ]
print( time.time() - start )
print( 60*"_" + " pure-[SERIAL] python execution" )
#----------------------------------------- ORANGES
start = time.time()
import ray # costs of initial import
ray.init( num_cpus = 4 ) # costs of parametrisation
@ray.remote # costs of decorated def(s)
def f( x ):
return x * x
print( time.time() - start )
print( 60*"_" + " ray add-on overheads" )
#----------------------------------------- APPLES-TO-APPLES ( STILL AWFULLY NAIVE )
start = time.time()
futures = [ f.remote(i) for i in range(4) ]
print( time.time() - start )
print( 60*"_" + " ray.remote-decorated python execution" )
Next comes the scaling :
For miniature scales of use, like building all the artillery of parallel/distributed code-execution for just -4- calls, the measurements are possible, yet skewed by many hardware-related and software-related tricks ( memory allocations and cache side-effects being most often the performance blockers, once the SuT has been well crafted not to overshadow these next typical HPC core troubles ).
>>> def f( x ):
... return x * x
...
>>> dis.dis( f )
2 0 LOAD_FAST 0 (x)
3 LOAD_FAST 0 (x)
6 BINARY_MULTIPLY
7 RETURN_VALUE
Having "low density" of computing ( here taking just one MUL x, x
in a straight RET
) will never justify all the initial setup-costs and all the per-call add-on overhead-costs, that matter in small computing-density cases, not so in complex and CPU-intensive HPC computing tasks ( for which the Amdahl's Law says,where are the principal limits for achievable speedups stay ).
The next snippet will show the average per-call costs of f.remote()-calls, spread over 4-CPU ray.remote
-processing paths, compared with a plain, monopolistic GIL-stepped mode of processing ( for details on [min, Avg, MAX, StDev] see other benchmarking posts )
#----------------------------------------- APPLES-TO-APPLES scaled ( STILL NAIVE )
test_4N = 1E6 # 1E6, 1E9, ... larger values may throw exception due to a poor ( not well fused ) range()-iterator construction, workarounds possible
start = time.time()
futures = [ f.remote(i) for i in range( int( test_4N ) ) ]
print( ( time.time() - start ) / test_4N )
print( 60*"_" + " ray.remote-decorated python execution per one call" )
#----------------------------------------- APPLES-TO-APPLES scaled ( STILL NAIVE )
start = time.time()
futures = [ f(i) for i in range( int( test_4N ) ) ]
print( time.time() - start / test_4N )
print( 60*"_" + " pure-[SERIAL] python execution" )
>!Bonus part<br><br>### Smoke-on : ###<br><br>If indeed interested in burning some more fuel to make some sense of how the intensive computing may benefit from a just-[CONCURRENT]
, True-[PARALLEL]
or even [tag:distributed-computing] try to add more CPU-intensive computing, some remarkable RAM memory allocations, go well beyond the CPU core's L3-cache sizes, pass larger BLOB-s between processes in parameters and result(s)' transfers, live near if not slightly beyond the O/S's efficient process-switching and RAM-swapping, simply go closer towards the real-life computing problems, where latency and resulting performance indeed matters :
import numpy as np
@ray.remote
def bigSmoke( voidPar = -1 ):
# +------------- this has to compute a lot
# +-------------------------|------------- this has to allocate quite some RAM ~ 130 MB for 100 x str( (2**18)! )
# | | + has to spend add-on overhead costs
# | | for process-to-process result(s) ~ 1.3 MB for (2**18)!
# | | SER/DES-transformations & transfer costs
# | +--------------------|------------- this has to allocate quite some RAM ~ 1.3 MB for (2**18)!
# | | | +- this set with care, soon O/S swapping may occur above physical RAM sizes
return [ str( np.math.factorial( i ) # | and immense degradation of the otherwise CPU-processing appears then
for i in int( 1E2 ) * ( 2**18, )
)
][-1] # <----------------------------- this reduces the amount of SER/DES-transformations & process-2-process transfer costs
...
#----------------------------------------- APPLES-TO-APPLES scaled + computing
test_4N = 1E1 # be cautious here, may start from 1E0, 1E1, 1E2, 1E3 ...
start = time.time()
futures = [ bigSmoke.remote(i) for i in range( int( test_4N ) ) ]
print( ( time.time() - start ) / test_4N )
print( 60*"_" + " ray.remote-decorated set of numpy.math.factorial( 2**18 ) per one call" )
#----------------------------------------- APPLES-TO-APPLES scaled + computing
start = time.time()
futures = [ bigSmoke(i) for i in range( int( test_4N ) ) ]
print( ( time.time() - start ) / test_4N )
print( 60*"_" + " pure-[SERIAL] python execution of a set of numpy.math.factorial( 2**18 ) per one call" )
Anyway, be warned that premature optimisation efforts are prone to mislead one's focus, so feel free to read performance-tuning stories so often presented here, in Stack Overflow.
答案2
得分: 1
多进程会导致时间开销增加 - 我认为这里的基本函数非常快,开销占据了大部分时间。教程真的使用一个简单的整数作为输入吗?如果你使用一个大数组作为输入,你应该会看到改进。
英文:
Multiprocessing creates time overhead - I think the base function here is so quick that the overhead takes the majority of the time. Does the tutorial really use a simple integer as input ? If you use a large array as input you should see an improvement.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论