2023年5月28日 01:35:41go评论193阅读模式

英文:

Error happens not always but just sometimes in MPI

问题

我正在尝试使用MPI和C执行并行程序，我设置了初始条件，以使每次运行都与上一次相同，但有时会出现错误，错误如下所示：

[MBP-di-Simone:76276] *** 进程收到信号 ***
[MBP-di-Simone:76276] 信号：Abort trap: 6 (6)
[MBP-di-Simone:76276] 信号代码：(0)
[MBP-di-Simone:76276] [ 0] 0   libsystem_platform.dylib            0x00007ff80f917c1d _sigtramp + 29
[MBP-di-Simone:76276] [ 1] 0   ???                                 0x0000000000000201 0x0 + 513
[MBP-di-Simone:76276] [ 2] 0   libsystem_c.dylib                   0x00007ff80f836ca5 abort + 123
...
[MBP-di-Simone:76276] *** 错误消息结束 ***

即使执行在它们之间是相同的，但并不是每次都会发生这种情况。我无法理解这个错误的含义，以及为什么有时会发生，有时不会？

编辑：这里是代码链接：https://github.com/simonerusso97/bfr-clustering-parallelized
请注意，这是一种集群算法，代码附带了CSV格式的数据集，我正在对其进行测试。

编辑：我搜索了缺少的malloc类型强制转换并在所有地方添加了它，错误变成了以下内容：

[MBP-di-Simone:85691] *** 进程收到信号 ***
[MBP-di-Simone:85691] 信号：Abort trap: 6 (6)
[MBP-di-Simone:85691] 信号代码：(0)
[MBP-di-Simone:85691] [ 0] 0   libsystem_platform.dylib            0x00007ff80f917c1d _sigtramp + 29
[MBP-di-Simone:85691] [ 1] 0   ???                                 0x00007ff7b74bced0 0x0 + 140701908848336
[MBP-di-Simone:85691] [ 2] 0   libsystem_c.dylib                   0x00007ff80f836ca5 abort + 123
...
[MBP-di-Simone:85691] *** 错误消息结束 ***

以及以下消息：

[MBP-di-Simone:86533] *** 进程收到信号 ***
[MBP-di-Simone:86533] 信号：Segmentation fault: 11 (11)
[MBP-di-Simone:86533] 信号代码：(0)
[MBP-di-Simone:86533] 失败地址：0x0
[MBP-di-Simone:86533] [ 0] 0   libsystem_platform.dylib            0x00007ff80f917c1d _sigtramp + 29
[MBP-di-Simone:86533] [ 1] 0   ???                                 0x0000000000000000 0x0 + 0
...
[MBP-di-Simone:86533] *** 错误消息结束 ***

奇怪的是，初始条件被选择为使执行始终相同，但错误并不会在每次运行中都触发，只是有时会发生。阅读错误消息，似乎发生在finalize() 调用中，但我无法理解为什么。

此外，很少情况下，它会出现以下消息：

[MBP-di-Simone:86488] *** 进程收到信号 ***

但它不会终止，只是挂起，我认为它没有继续计算。

英文:

I'm trying to execute a parallel program using MPI and C, I imposed the initial condition such that every run is always equal to the previous, but sometimes I get an error, the error is alternating this:

[MBP-di-Simone:76276] *** Process received signal ***
[MBP-di-Simone:76276] Signal: Abort trap: 6 (6)
[MBP-di-Simone:76276] Signal code:  (0)
[MBP-di-Simone:76276] [ 0] 0   libsystem_platform.dylib            0x00007ff80f917c1d _sigtramp + 29
[MBP-di-Simone:76276] [ 1] 0   ???                                 0x0000000000000201 0x0 + 513
[MBP-di-Simone:76276] [ 2] 0   libsystem_c.dylib                   0x00007ff80f836ca5 abort + 123
[MBP-di-Simone:76276] [ 3] 0   libsystem_malloc.dylib              0x00007ff80f74ca37 malloc_vreport + 888
[MBP-di-Simone:76276] [ 4] 0   libsystem_malloc.dylib              0x00007ff80f761959 malloc_zone_error + 178
[MBP-di-Simone:76276] [ 5] 0   libsystem_malloc.dylib              0x00007ff80f75a12b nanov2_guard_corruption_detected + 34
[MBP-di-Simone:76276] [ 6] 0   libsystem_malloc.dylib              0x00007ff80f7596e9 nanov2_allocate_outlined + 374
[MBP-di-Simone:76276] [ 7] 0   libsystem_malloc.dylib              0x00007ff80f73f59d nanov2_malloc + 526
[MBP-di-Simone:76276] [ 8] 0   libopen-pal.40.dylib                0x000000010562f5eb opal_show_help_yyensure_buffer_stack + 128
[MBP-di-Simone:76276] [ 9] 0   libopen-pal.40.dylib                0x000000010562f8b3 opal_show_help_yy_switch_to_buffer + 14
[MBP-di-Simone:76276] [10] 0   libopen-pal.40.dylib                0x000000010562fdc5 opal_show_help_init_buffer + 22
[MBP-di-Simone:76276] [11] 0   libopen-pal.40.dylib                0x000000010562e777 opal_show_help_vstring + 386
[MBP-di-Simone:76276] [12] 0   libopen-rte.40.dylib                0x00000001057cb9e8 orte_show_help + 171
[MBP-di-Simone:76276] [13] 0   libmpi.40.dylib                     0x000000010569c9ba backend_fatal + 606
[MBP-di-Simone:76276] [14] 0   libmpi.40.dylib                     0x000000010569c73e ompi_mpi_errors_are_fatal_comm_handler + 159
[MBP-di-Simone:76276] [15] 0   libmpi.40.dylib                     0x000000010569c3df ompi_errhandler_invoke + 99
[MBP-di-Simone:76276] [16] 0   libmpi.40.dylib                     0x00000001056d5b33 MPI_Recv + 651
[MBP-di-Simone:76276] [17] 0   main                                0x00000001054dd1d6 main + 3958
[MBP-di-Simone:76276] [18] 0   dyld                                0x00007ff80f5bc310 start + 2432
[MBP-di-Simone:76276] *** End of error message ***
--------------------------------------------------------------------------
 Primary job  terminated normally, but 1 process returned
 a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
 mpirun noticed that process rank 0 with PID 0 on node MBP-di-Simone exited on signal 6 (Abort trap: 6).

Even if the executions are equal among them, it doesn't happen in every run. I can't understand what it means. Somebody can explain me the meaning of this error and how is possible that sometimes it happens and sometimes not?

Edit: here the link to the code: https://github.com/simonerusso97/bfr-clustering-parallelized
Note that it is a kind of cluster algorithm, the code is provided with the dataset in csv on which I'm doing my test

Edit: I searched for missing malloc cast and added it everywhere and the error changed showing this:

[MBP-di-Simone:85691] *** Process received signal ***
[MBP-di-Simone:85691] Signal: Abort trap: 6 (6)
[MBP-di-Simone:85691] Signal code:  (0)
[MBP-di-Simone:85691] [ 0] 0   libsystem_platform.dylib            0x00007ff80f917c1d _sigtramp + 29
[MBP-di-Simone:85691] [ 1] 0   ???                                 0x00007ff7b74bced0 0x0 + 140701908848336
[MBP-di-Simone:85691] [ 2] 0   libsystem_c.dylib                   0x00007ff80f836ca5 abort + 123
[MBP-di-Simone:85691] [ 3] 0   libsystem_malloc.dylib              0x00007ff80f74ca37 malloc_vreport + 888
[MBP-di-Simone:85691] [ 4] 0   libsystem_malloc.dylib              0x00007ff80f761959 malloc_zone_error + 178
[MBP-di-Simone:85691] [ 5] 0   libsystem_malloc.dylib              0x00007ff80f745aa6 small_free_list_remove_ptr_no_clear + 1000
[MBP-di-Simone:85691] [ 6] 0   libsystem_malloc.dylib              0x00007ff80f742f7c free_small + 619
[MBP-di-Simone:85691] [ 7] 0   libmpi.40.dylib                     0x0000000108c89115 mca_coll_base_comm_unselect + 11654
[MBP-di-Simone:85691] [ 8] 0   libmpi.40.dylib                     0x0000000108bf81aa ompi_comm_destruct + 32
[MBP-di-Simone:85691] [ 9] 0   libmpi.40.dylib                     0x0000000108bf9120 ompi_comm_finalize + 232
[MBP-di-Simone:85691] [10] 0   libmpi.40.dylib                     0x0000000108c13620 ompi_mpi_finalize + 923
[MBP-di-Simone:85691] [11] 0   main                                0x0000000108a4671b main + 5163
[MBP-di-Simone:85691] [12] 0   dyld                                0x00007ff80f5bc310 start + 2432
[MBP-di-Simone:85691] *** End of error message ***
--------------------------------------------------------------------------
 Primary job  terminated normally, but 1 process returned
 a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
 A system call failed during shared memory initialization that should
 not have.  It is likely that your MPI job will now either abort or
 experience performance degradation.

  Local host:  MBP-di-Simone.homenet.telecomitalia.it
  System call: unlink(2)         /var/folders/d6/st4lmq9x55bd2cw12j85xz8m0000gn/T//ompi.MBP-di-Simone.501/pid.85690/1/vader_segment.MBP-di-Simone.501.ae5e0001.1
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
[MBP-di-Simone.homenet.telecomitalia.it:85690] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
--------------------------------------------------------------------------
 mpirun noticed that process rank 0 with PID 0 on node MBP-di-Simone exited on signal 6 (Abort trap: 6).

with this:

[MBP-di-Simone:86533] *** Process received signal ***
[MBP-di-Simone:86533] Signal: Segmentation fault: 11 (11)
[MBP-di-Simone:86533] Signal code:  (0)
[MBP-di-Simone:86533] Failing at address: 0x0
[MBP-di-Simone:86533] [ 0] 0   libsystem_platform.dylib            0x00007ff80f917c1d _sigtramp + 29
[MBP-di-Simone:86533] [ 1] 0   ???                                 0x0000000000000000 0x0 + 0
[MBP-di-Simone:86533] [ 2] 0   libmpi.40.dylib                     0x000000010f02d1aa ompi_comm_destruct + 32
[MBP-di-Simone:86533] [ 3] 0   libmpi.40.dylib                     0x000000010f02e069 ompi_comm_finalize + 49
[MBP-di-Simone:86533] [ 4] 0   libmpi.40.dylib                     0x000000010f048620 ompi_mpi_finalize + 923
[MBP-di-Simone:86533] [ 5] 0   main                                0x000000010ee7b71b main + 5163
[MBP-di-Simone:86533] [ 6] 0   dyld                                0x00007ff80f5bc310 start + 2432
[MBP-di-Simone:86533] *** End of error message ***
--------------------------------------------------------------------------
 Primary job  terminated normally, but 1 process returned
 a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
 A system call failed during shared memory initialization that should
 not have.  It is likely that your MPI job will now either abort or
 experience performance degradation.

  Local host:  MBP-di-Simone.homenet.telecomitalia.it
  System call: unlink(2) /var/folders/d6/st4lmq9x55bd2cw12j85xz8m0000gn/T//ompi.MBP-di-Simone.501/pid.86532/1/vader_segment.MBP-di-Simone.501.b2e00001.1
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
[MBP-di-Simone.homenet.telecomitalia.it:86532] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
--------------------------------------------------------------------------
 mpirun noticed that process rank 0 with PID 0 on node MBP-di-Simone exited on signal 11 (Segmentation fault: 11).
--------------------------------------------------------------------------

The strange thing is that the initial conditions are chosen in a way that the execution is always the same but the error isn't thrown in each run but only sometimes. Reading the error message it seems to happen in finalize() call but I can't understand why

Moreover very few times it give this message:

[MBP-di-Simone:86488] *** Process received signal ***

but it doesn't terminate, it just stay hang and I think that doesn't continue computation

答案1

得分: 1

堆栈跟踪显示，MPI_Recv() 由于未知原因而失败，调用了错误处理程序（这是默认行为），并且错误处理程序在 malloc() 中崩溃。malloc() 调用的函数名称强烈暗示了一些内存损坏问题，而不是系统耗尽内存。

请注意，Open MPI 的默认行为是检查 MPI 子例程的参数（例如，有效的通信器、已提交的数据类型、非负计数等），因此 MPI_Recv() 的失败可能是由于无效的参数或一些内部故障导致的。我猜可能是后者，但谁知道呢。

我最好的建议是使用 valgrind 并尝试查找一些内存损坏。您还可以尝试强制使用 TCP/IP 进行通信，以排除可能会使 valgrind 混淆的某些互连特定内容：

mpirun --mca pml ob1 --mca btl tcp,self ... valgrind ...

英文:

The stack trace shows that MPI_Recv() failed for an unknown reason, invoked the error handler (which is the default behavior) and the error handler crashed in malloc(). The name of the functions invoked by malloc() strongly suggests some kind of memory corruption rather than the system running out of memory.

Note the default behavior of Open MPI is to check MPI subroutines parameters (e.g. valid communicator, commited datatype, non negative count, ...) so the MPI_Recv() failure could be caused by invalid parameters or some internal failure. I would guess the latter but who knows.

My best bet would be to use valgrind and try to spot some memory corruption.
You might also try to force communications over TCP/IP in order to rule out some interconnect specific stuff that could confuse valgrind:

mpirun --mca pml ob1 --mca btl tcp,self ... valgrind ...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

错误并不总是发生，只是在MPI中偶尔发生。

问题

答案1

如何在不使用fgets()的情况下扫描句子中的换行符？

为什么在打印数组的单个值时要使用百分号 %？

奇怪的终端显示故障在C语言，Ubuntu

CygWin How to create a Window with GNU C++ that can be move without a X-Server Window Manager

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论