英文:
Error happens not always but just sometimes in MPI
问题
我正在尝试使用MPI和C执行并行程序,我设置了初始条件,以使每次运行都与上一次相同,但有时会出现错误,错误如下所示:
[MBP-di-Simone:76276] *** 进程收到信号 ***
[MBP-di-Simone:76276] 信号:Abort trap: 6 (6)
[MBP-di-Simone:76276] 信号代码:(0)
[MBP-di-Simone:76276] [ 0] 0 libsystem_platform.dylib 0x00007ff80f917c1d _sigtramp + 29
[MBP-di-Simone:76276] [ 1] 0 ??? 0x0000000000000201 0x0 + 513
[MBP-di-Simone:76276] [ 2] 0 libsystem_c.dylib 0x00007ff80f836ca5 abort + 123
...
[MBP-di-Simone:76276] *** 错误消息结束 ***
即使执行在它们之间是相同的,但并不是每次都会发生这种情况。我无法理解这个错误的含义,以及为什么有时会发生,有时不会?
编辑:这里是代码链接:https://github.com/simonerusso97/bfr-clustering-parallelized
请注意,这是一种集群算法,代码附带了CSV格式的数据集,我正在对其进行测试。
编辑:我搜索了缺少的malloc类型强制转换并在所有地方添加了它,错误变成了以下内容:
[MBP-di-Simone:85691] *** 进程收到信号 ***
[MBP-di-Simone:85691] 信号:Abort trap: 6 (6)
[MBP-di-Simone:85691] 信号代码:(0)
[MBP-di-Simone:85691] [ 0] 0 libsystem_platform.dylib 0x00007ff80f917c1d _sigtramp + 29
[MBP-di-Simone:85691] [ 1] 0 ??? 0x00007ff7b74bced0 0x0 + 140701908848336
[MBP-di-Simone:85691] [ 2] 0 libsystem_c.dylib 0x00007ff80f836ca5 abort + 123
...
[MBP-di-Simone:85691] *** 错误消息结束 ***
以及以下消息:
[MBP-di-Simone:86533] *** 进程收到信号 ***
[MBP-di-Simone:86533] 信号:Segmentation fault: 11 (11)
[MBP-di-Simone:86533] 信号代码:(0)
[MBP-di-Simone:86533] 失败地址:0x0
[MBP-di-Simone:86533] [ 0] 0 libsystem_platform.dylib 0x00007ff80f917c1d _sigtramp + 29
[MBP-di-Simone:86533] [ 1] 0 ??? 0x0000000000000000 0x0 + 0
...
[MBP-di-Simone:86533] *** 错误消息结束 ***
奇怪的是,初始条件被选择为使执行始终相同,但错误并不会在每次运行中都触发,只是有时会发生。阅读错误消息,似乎发生在finalize()
调用中,但我无法理解为什么。
此外,很少情况下,它会出现以下消息:
[MBP-di-Simone:86488] *** 进程收到信号 ***
但它不会终止,只是挂起,我认为它没有继续计算。
英文:
I'm trying to execute a parallel program using MPI and C, I imposed the initial condition such that every run is always equal to the previous, but sometimes I get an error, the error is alternating this:
[MBP-di-Simone:76276] *** Process received signal ***
[MBP-di-Simone:76276] Signal: Abort trap: 6 (6)
[MBP-di-Simone:76276] Signal code: (0)
[MBP-di-Simone:76276] [ 0] 0 libsystem_platform.dylib 0x00007ff80f917c1d _sigtramp + 29
[MBP-di-Simone:76276] [ 1] 0 ??? 0x0000000000000201 0x0 + 513
[MBP-di-Simone:76276] [ 2] 0 libsystem_c.dylib 0x00007ff80f836ca5 abort + 123
[MBP-di-Simone:76276] [ 3] 0 libsystem_malloc.dylib 0x00007ff80f74ca37 malloc_vreport + 888
[MBP-di-Simone:76276] [ 4] 0 libsystem_malloc.dylib 0x00007ff80f761959 malloc_zone_error + 178
[MBP-di-Simone:76276] [ 5] 0 libsystem_malloc.dylib 0x00007ff80f75a12b nanov2_guard_corruption_detected + 34
[MBP-di-Simone:76276] [ 6] 0 libsystem_malloc.dylib 0x00007ff80f7596e9 nanov2_allocate_outlined + 374
[MBP-di-Simone:76276] [ 7] 0 libsystem_malloc.dylib 0x00007ff80f73f59d nanov2_malloc + 526
[MBP-di-Simone:76276] [ 8] 0 libopen-pal.40.dylib 0x000000010562f5eb opal_show_help_yyensure_buffer_stack + 128
[MBP-di-Simone:76276] [ 9] 0 libopen-pal.40.dylib 0x000000010562f8b3 opal_show_help_yy_switch_to_buffer + 14
[MBP-di-Simone:76276] [10] 0 libopen-pal.40.dylib 0x000000010562fdc5 opal_show_help_init_buffer + 22
[MBP-di-Simone:76276] [11] 0 libopen-pal.40.dylib 0x000000010562e777 opal_show_help_vstring + 386
[MBP-di-Simone:76276] [12] 0 libopen-rte.40.dylib 0x00000001057cb9e8 orte_show_help + 171
[MBP-di-Simone:76276] [13] 0 libmpi.40.dylib 0x000000010569c9ba backend_fatal + 606
[MBP-di-Simone:76276] [14] 0 libmpi.40.dylib 0x000000010569c73e ompi_mpi_errors_are_fatal_comm_handler + 159
[MBP-di-Simone:76276] [15] 0 libmpi.40.dylib 0x000000010569c3df ompi_errhandler_invoke + 99
[MBP-di-Simone:76276] [16] 0 libmpi.40.dylib 0x00000001056d5b33 MPI_Recv + 651
[MBP-di-Simone:76276] [17] 0 main 0x00000001054dd1d6 main + 3958
[MBP-di-Simone:76276] [18] 0 dyld 0x00007ff80f5bc310 start + 2432
[MBP-di-Simone:76276] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node MBP-di-Simone exited on signal 6 (Abort trap: 6).
Even if the executions are equal among them, it doesn't happen in every run. I can't understand what it means. Somebody can explain me the meaning of this error and how is possible that sometimes it happens and sometimes not?
Edit: here the link to the code: https://github.com/simonerusso97/bfr-clustering-parallelized
Note that it is a kind of cluster algorithm, the code is provided with the dataset in csv on which I'm doing my test
Edit: I searched for missing malloc cast and added it everywhere and the error changed showing this:
[MBP-di-Simone:85691] *** Process received signal ***
[MBP-di-Simone:85691] Signal: Abort trap: 6 (6)
[MBP-di-Simone:85691] Signal code: (0)
[MBP-di-Simone:85691] [ 0] 0 libsystem_platform.dylib 0x00007ff80f917c1d _sigtramp + 29
[MBP-di-Simone:85691] [ 1] 0 ??? 0x00007ff7b74bced0 0x0 + 140701908848336
[MBP-di-Simone:85691] [ 2] 0 libsystem_c.dylib 0x00007ff80f836ca5 abort + 123
[MBP-di-Simone:85691] [ 3] 0 libsystem_malloc.dylib 0x00007ff80f74ca37 malloc_vreport + 888
[MBP-di-Simone:85691] [ 4] 0 libsystem_malloc.dylib 0x00007ff80f761959 malloc_zone_error + 178
[MBP-di-Simone:85691] [ 5] 0 libsystem_malloc.dylib 0x00007ff80f745aa6 small_free_list_remove_ptr_no_clear + 1000
[MBP-di-Simone:85691] [ 6] 0 libsystem_malloc.dylib 0x00007ff80f742f7c free_small + 619
[MBP-di-Simone:85691] [ 7] 0 libmpi.40.dylib 0x0000000108c89115 mca_coll_base_comm_unselect + 11654
[MBP-di-Simone:85691] [ 8] 0 libmpi.40.dylib 0x0000000108bf81aa ompi_comm_destruct + 32
[MBP-di-Simone:85691] [ 9] 0 libmpi.40.dylib 0x0000000108bf9120 ompi_comm_finalize + 232
[MBP-di-Simone:85691] [10] 0 libmpi.40.dylib 0x0000000108c13620 ompi_mpi_finalize + 923
[MBP-di-Simone:85691] [11] 0 main 0x0000000108a4671b main + 5163
[MBP-di-Simone:85691] [12] 0 dyld 0x00007ff80f5bc310 start + 2432
[MBP-di-Simone:85691] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have. It is likely that your MPI job will now either abort or
experience performance degradation.
Local host: MBP-di-Simone.homenet.telecomitalia.it
System call: unlink(2) /var/folders/d6/st4lmq9x55bd2cw12j85xz8m0000gn/T//ompi.MBP-di-Simone.501/pid.85690/1/vader_segment.MBP-di-Simone.501.ae5e0001.1
Error: No such file or directory (errno 2)
--------------------------------------------------------------------------
[MBP-di-Simone.homenet.telecomitalia.it:85690] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node MBP-di-Simone exited on signal 6 (Abort trap: 6).
with this:
[MBP-di-Simone:86533] *** Process received signal ***
[MBP-di-Simone:86533] Signal: Segmentation fault: 11 (11)
[MBP-di-Simone:86533] Signal code: (0)
[MBP-di-Simone:86533] Failing at address: 0x0
[MBP-di-Simone:86533] [ 0] 0 libsystem_platform.dylib 0x00007ff80f917c1d _sigtramp + 29
[MBP-di-Simone:86533] [ 1] 0 ??? 0x0000000000000000 0x0 + 0
[MBP-di-Simone:86533] [ 2] 0 libmpi.40.dylib 0x000000010f02d1aa ompi_comm_destruct + 32
[MBP-di-Simone:86533] [ 3] 0 libmpi.40.dylib 0x000000010f02e069 ompi_comm_finalize + 49
[MBP-di-Simone:86533] [ 4] 0 libmpi.40.dylib 0x000000010f048620 ompi_mpi_finalize + 923
[MBP-di-Simone:86533] [ 5] 0 main 0x000000010ee7b71b main + 5163
[MBP-di-Simone:86533] [ 6] 0 dyld 0x00007ff80f5bc310 start + 2432
[MBP-di-Simone:86533] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have. It is likely that your MPI job will now either abort or
experience performance degradation.
Local host: MBP-di-Simone.homenet.telecomitalia.it
System call: unlink(2) /var/folders/d6/st4lmq9x55bd2cw12j85xz8m0000gn/T//ompi.MBP-di-Simone.501/pid.86532/1/vader_segment.MBP-di-Simone.501.b2e00001.1
Error: No such file or directory (errno 2)
--------------------------------------------------------------------------
[MBP-di-Simone.homenet.telecomitalia.it:86532] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node MBP-di-Simone exited on signal 11 (Segmentation fault: 11).
--------------------------------------------------------------------------
The strange thing is that the initial conditions are chosen in a way that the execution is always the same but the error isn't thrown in each run but only sometimes. Reading the error message it seems to happen in finalize() call but I can't understand why
Moreover very few times it give this message:
[MBP-di-Simone:86488] *** Process received signal ***
but it doesn't terminate, it just stay hang and I think that doesn't continue computation
答案1
得分: 1
堆栈跟踪显示,MPI_Recv()
由于未知原因而失败,调用了错误处理程序(这是默认行为),并且错误处理程序在 malloc()
中崩溃。malloc()
调用的函数名称强烈暗示了一些内存损坏问题,而不是系统耗尽内存。
请注意,Open MPI 的默认行为是检查 MPI 子例程的参数(例如,有效的通信器、已提交的数据类型、非负计数等),因此 MPI_Recv()
的失败可能是由于无效的参数或一些内部故障导致的。我猜可能是后者,但谁知道呢。
我最好的建议是使用 valgrind
并尝试查找一些内存损坏。您还可以尝试强制使用 TCP/IP 进行通信,以排除可能会使 valgrind
混淆的某些互连特定内容:
mpirun --mca pml ob1 --mca btl tcp,self ... valgrind ...
英文:
The stack trace shows that MPI_Recv()
failed for an unknown reason, invoked the error handler (which is the default behavior) and the error handler crashed in malloc()
. The name of the functions invoked by malloc()
strongly suggests some kind of memory corruption rather than the system running out of memory.
Note the default behavior of Open MPI is to check MPI subroutines parameters (e.g. valid communicator, commited datatype, non negative count, ...) so the MPI_Recv()
failure could be caused by invalid parameters or some internal failure. I would guess the latter but who knows.
My best bet would be to use valgrind
and try to spot some memory corruption.
You might also try to force communications over TCP/IP in order to rule out some interconnect specific stuff that could confuse valgrind
:
mpirun --mca pml ob1 --mca btl tcp,self ... valgrind ...
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论