2023年6月12日 14:47:44go评论129阅读模式

英文:

MPI_Scatterv from Intel MPI (mpiifort) using MPI data types is much slower (23 times) compared to flattening array and scattering. Why it could be?

问题

以下是您提供的代码的翻译部分：

ifort版本：（IFORT）2021.8.0 20221119
操作系统：WSL Ubuntu 20.04 LTS

1. 我有一个（1000x1000x1000）的三维数组要在进程之间分配。在第一种方法中，我将数组展平，然后将数组分发给进程，大约需要7.86秒。

2. 在第二种方法中，我使用MPI派生数据类型来Scatterv三维数组，我注意到这需要约165.34秒。但相同数据的Gatherv花费了约14.24秒。

3. 这种不一致的原因是什么？我预期Scatterv需要与Gatherv相似的时间。

以下是代码：

program ex_scatterv
use mpi
use iso_fortran_env, only : real64
implicit none

! 分配数组
real(real64), allocatable,dimension(:,:,:) :: array, array_local
real(real64), allocatable,dimension(:) :: array_flat, array_local_flat
integer :: rank, num_procs, i, j, k
integer :: nx, ny, nz, str_idx, end_idx, local_size, local_size_flat
integer, dimension(:), allocatable :: sendcounts, displacements
integer :: sizes(3), sub_sizes(3), starts(3), recv_starts(3), recv_sizes(3), &
send_type, resize_send_type, recv_type, resize_recv_type
integer(kind=8) :: lb, extent, lb_resize
real(real64) :: start_time
integer :: mpierr
call mpi_init(mpierr)
call mpi_comm_size(mpi_comm_world, num_procs, mpierr)
call mpi_comm_rank(mpi_comm_world, rank, mpierr)

! 数组的大小
nx=1000
ny=1000
nz=1000

if(rank==0) then
if(num_procs>nx) then
print*, "进程数量应小于或等于数组的第一维度"
call MPI_Abort(mpi_comm_world, 1, mpierr)
endif
endif

start_time=MPI_Wtime()
! 在根进程中分配
if(rank==0) then
allocate(array(nx,ny,nz))
allocate(array_flat(nx*ny*nz))
else !对于其他进程，分配零大小
allocate(array(0,0,0))
endif

! 为数组分配值
if(rank==0) then
do k=1,nz
do j=1,ny
do i=1,nx
array(i,j,k) = (i-1)+(j-1)*nx+(k-1)*nx*ny
end do
end do
end do
! 展平三维数组
forall(k=1:nz, j=1:ny, i=1:nx) array_flat(k+(j-1)*nz+(i-1)*ny*nz)=array(i,j,k)
endif

! 在不同的进程之间分发三维数组
call distribute_points(nx, rank, num_procs, str_idx, end_idx)
local_size = end_idx - str_idx + 1
local_size_flat = local_size*ny*nz

! 为每个进程分配本地数组
allocate(array_local_flat(local_size_flat))
allocate(array_local(local_size, ny, nz))

! 为广播分配sendcounts和displacements数组
allocate(sendcounts(num_procs), displacements(num_procs))

! 收集所有进程的displacements和sendcounts
call MPI_Allgather(str_idx, 1, MPI_INTEGER, displacements, 1, MPI_INTEGER, &
MPI_COMM_WORLD, mpierr)
call MPI_Allgather(local_size, 1, MPI_INTEGER, sendcounts, 1, &
MPI_INTEGER, MPI_COMM_WORLD, mpierr)

! 总的sendcounts和displacements
sendcounts = sendcounts*ny*nz
displacements = displacements - 1 ! MPI中的数组索引从0开始（C语言）
displacements = displacements*ny*nz

! 将展平的数组分散到进程之间
call MPI_Scatterv(array_flat, sendcounts, displacements, MPI_DOUBLE_PRECISION, &
array_local_flat, local_size*ny*nz, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, &
mpierr)

! 从展平的本地数组中形成三维数组
forall(k=1:nz, j=1:ny, i=1:local_size) array_local(i,j,k) = &
array_local_flat(k+(j-1)*nz+(i-1)*ny*nz)

! 如果rank为0，则打印时间
if(rank==0) then
print*, "展平和分散所花费的时间：", MPI_Wtime()-start_time
endif

call MPI_Barrier(mpi_comm_world, mpierr)
! 释放(array_flat, array_local_flat)

start_time=MPI_Wtime()
! 使用子数组类型进行分散
sizes = [nx, ny, nz]
recv_sizes=[local_size, ny, nz]
sub_sizes = [1, ny, nz]
starts = [0, 0, 0]
recv_starts = [0, 0, 0]

! 获取MPI_DOUBLE_PRECISION的extent
call MPI_Type_get_extent(MPI_DOUBLE_PRECISION, lb, extent, mpierr)

! 为发送数据创建MPI子数组数据类型
call MPI_Type_create_subarray(3, sizes, sub_sizes, starts, &
MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, send_type, mpierr)
lb_resize=0
! 调整发送子数组以从正确位置开始发送
call MPI_Type_create_resized(send_type, lb_resize, extent, &
resize_send_type, mpierr)
call MPI_Type_commit(resize_send_type, mpierr)

! 为接收数据创建MPI子数组数据类型
call MPI_Type_create_subarray(3, recv_sizes, sub_sizes, recv_starts, &
MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, recv_type, mpierr)

! 调整接收子数组以从正确位置开始接收
call MPI_Type_create_resized(recv_type, lb_resize, extent, &
resize_recv_type, mpierr)
call MPI_Type_commit(resize_recv_type, mpierr)

! 为发送和接收子数组分配sendcounts和displacements
sendcounts=sendcounts/(ny*nz)
displacements = displacements/(ny*nz)

if(rank==0) then
print*, "创建MPI类型子数组所花费的时间：", MPI_Wtime()-start_time
endif

call MPI_Barrier(mpi_comm_world, mpierr)
start_time=MPI_Wtime()
! 分散子数组
call MPI_Scatterv(array, sendcounts, displacements, resize_send_type, &
array_local, sendcounts, resize_recv_type, 0,

<details>
<summary>英文:</summary>


ifort version: (IFORT) 2021.8.0 20221119
OS: WSL Ubuntu 20.04LTS

1. I have a (1000x1000x1000) 3D array to distribute among procs. In the first approach, I flatten the array and then distribute arrays among procs and it takes about 7.86sec

2. In the second approach I use MPI-derived data types to Scatterv the 3D array and I noticed that it takes about 165.34sec. But Gatherv of the same data took about 14.24 sec.

3. What could be the reason for this inconsistency? I expected Scatterv to take similar time as Gatherv

Here is the code

 
```c++
program ex_scatterv
use mpi
use iso_fortran_env, only : real64
implicit none

!allocate arrays
real(real64), allocatable,dimension(:,:,:) :: array, array_local
real(real64), allocatable,dimension(:) :: array_flat, array_local_flat
integer :: rank, num_procs, i, j, k
integer :: nx, ny, nz, str_idx, end_idx, local_size, local_size_flat
integer, dimension(:), allocatable :: sendcounts, displacements
integer :: sizes(3), sub_sizes(3), starts(3), recv_starts(3), recv_sizes(3), &amp;
send_type, resize_send_type, recv_type, resize_recv_type
integer(kind=8) :: lb, extent, lb_resize
real(real64) :: start_time
integer :: mpierr
call mpi_init(mpierr)
call mpi_comm_size(mpi_comm_world, num_procs, mpierr)
call mpi_comm_rank(mpi_comm_world, rank, mpierr)

!size of array
nx=1000
ny=1000
nz=1000

if(rank==0) then
if(num_procs&gt;nx) then
print*, &quot;Number of procs should be less than or equal to first dimension of the array&quot;
call MPI_Abort(mpi_comm_world, 1, mpierr)
endif
endif

start_time=MPI_Wtime()
!allocate in the root rank
if(rank==0) then
allocate(array(nx,ny,nz))
allocate(array_flat(nx*ny*nz))
else !for other procs allocate with zero size
allocate(array(0,0,0))
endif

!assign values to the array
if(rank==0) then
do k=1,nz
do j=1,ny
do i=1,nx
array(i,j,k) = (i-1)+(j-1)*nx+(k-1)*nx*ny
end do
end do
end do
!print*, &quot;Before scattering...&quot;
!print*, array
!flatten the 3D array
forall(k=1:nz, j=1:ny, i=1:nx) array_flat(k+(j-1)*nz+(i-1)*ny*nz)=array(i,j,k)
endif

!distribute the 3d array among different procs
call distribute_points(nx, rank, num_procs, str_idx, end_idx)
local_size = end_idx - str_idx + 1
local_size_flat = local_size*ny*nz

!allocate local(for each rank) arrays
allocate(array_local_flat(local_size_flat))
allocate(array_local(local_size, ny, nz))

!allocate sendcoutns and displacements arrays for braodcasting
allocate(sendcounts(num_procs), displacements(num_procs))

!gather displacements and sendcounts for all ranks
call MPI_Allgather(str_idx, 1, MPI_INTEGER, displacements, 1, MPI_INTEGER, &amp;
MPI_COMM_WORLD, mpierr)
call MPI_Allgather(local_size, 1, MPI_INTEGER, sendcounts, 1, &amp;
MPI_INTEGER, MPI_COMM_WORLD, mpierr)

!total sendcounts and displacements
sendcounts = sendcounts*ny*nz
displacements = displacements - 1 !Array index starts with 0 in MPI (C)
displacements = displacements*ny*nz

!scatter the flattened array among procs
call MPI_Scatterv(array_flat, sendcounts, displacements, MPI_DOUBLE_PRECISION, &amp;
array_local_flat, local_size*ny*nz, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, &amp;
mpierr)

!form 3D array from flattened local array
forall(k=1:nz, j=1:ny, i=1:local_size) array_local(i,j,k) = &amp;
array_local_flat(k+(j-1)*nz+(i-1)*ny*nz)

!print*, &quot;Scattered array: &quot;, rank
!print*, array_local
if(rank==0) then
print*, &quot;Time taken by flatten and scatter: &quot;, MPI_Wtime()-start_time
endif

call MPI_Barrier(mpi_comm_world, mpierr)
!deallocate(array_flat, array_local_flat)

start_time=MPI_Wtime()
!Scatterning using subarray type
sizes = [nx, ny, nz]
recv_sizes=[local_size, ny, nz]
sub_sizes = [1, ny, nz]
starts = [0, 0, 0]
recv_starts = [0, 0, 0]

!to get extent of MPI_DOUBLE_PRECISION
call MPI_Type_get_extent(MPI_DOUBLE_PRECISION, lb, extent, mpierr)

!create a mpi subarray data type for sending data
call MPI_Type_create_subarray(3, sizes, sub_sizes, starts, &amp;
MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, send_type, mpierr)
lb_resize=0
!resize the send subarray for starting at correct location for next send
call MPI_Type_create_resized(send_type, lb_resize, extent, &amp;
resize_send_type, mpierr)
call MPI_Type_commit(resize_send_type, mpierr)

!create a mpi subarray data type for receiving data
call MPI_Type_create_subarray(3, recv_sizes, sub_sizes, recv_starts, &amp;
MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, recv_type, mpierr)

!resize the receive subarray for starting at correct location for next receive
call MPI_Type_create_resized(recv_type, lb_resize, extent, &amp;
resize_recv_type, mpierr)
call MPI_Type_commit(resize_recv_type, mpierr)

!sendcounts and displacement for sending and receiving subarrays
sendcounts=sendcounts/(ny*nz)
displacements = displacements/(ny*nz)

if(rank==0) then
print*, &quot;Time taken for creating MPI type subarrays: &quot;, MPI_Wtime()-start_time
endif

call MPI_Barrier(mpi_comm_world, mpierr)
start_time=MPI_Wtime()
!scatter the subarrays
call MPI_Scatterv(array, sendcounts, displacements, resize_send_type, &amp;
array_local, sendcounts, resize_recv_type, 0, MPI_COMM_WORLD, mpierr)

if(rank==0) then
print*, &quot;Time taken for scattering using MPI type subarrays: &quot;, MPI_Wtime()-start_time
endif
call MPI_Barrier(mpi_comm_world, mpierr)
!print the scattered array
!print*, &quot;Scattered array with subarray: &quot;, rank
!print*, array_local

!do some computations on the scattered local arrays
array_local = array_local+1

call MPI_Barrier(mpi_comm_world, mpierr)
start_time=MPI_Wtime()
!Gather the local arrays to global (array) using the same subarrays
call MPI_Gatherv(array_local, local_size, resize_recv_type, array, &amp;
sendcounts, displacements, resize_send_type, 0, MPI_COMM_WORLD, mpierr)

if(rank==0) then
print*, &quot;Time taken by MPI_Type_create_subarray Gathering: &quot;, MPI_Wtime()-start_time
endif

!if(rank==0) then
! print*, &quot;Gathered array: ------------------&quot;
! print*, array
!endif
call MPI_Finalize(mpierr)

 

contains

subroutine distribute_points(npts, rank, size, start_idx, end_idx)
implicit none

integer, intent(in) :: npts, size, rank
integer, intent(out) :: start_idx, end_idx
integer :: pts_per_proc

pts_per_proc = npts/size

if(rank &lt; mod(npts, size)) then
pts_per_proc=pts_per_proc + 1
end if

if(rank &lt; mod(npts, size)) then
start_idx = rank * pts_per_proc + 1
end_idx = (rank + 1) * pts_per_proc
else
start_idx = mod(npts, size) + rank*pts_per_proc + 1
end_idx = mod(npts, size) + (rank + 1) * pts_per_proc
end if

end subroutine distribute_points


end program ex_scatterv

答案1

得分: 1

有许多原因会导致MPI数据类型比用户级打包和发送操作慢。

我在https://arxiv.org/abs/1809.10778中进行了探讨。

数据类型与普通缓冲区不同，它们不直接从内存流出：它们会被读取，写入到紧凑的缓冲区，然后再次用于发送。（如果您的网络卡昂贵，它们可能实际上会直接从内存流出，但不要指望这一点。）因此，使用派生类型可能会带来带宽损失。
MPI可能不会一次发送整个消息，例如因为它不愿意创建非常大的内部缓冲区。
如果您在一个定时循环中执行发送操作，您会遇到MPI是无状态的事实，因此它会反复分配和释放其内部缓冲区。

在您的特定情况下，正如一些评论中指出的，您的数据的不规则性也可能会使您的Scatterv操作效率低下。

英文:

There are many reasons why MPI datatypes can be slower than a user-level pack-and-send operation.

I have explored this in https://arxiv.org/abs/1809.10778

Data types, unlike plain buffers, are not streamed straight from memory: they are read, written to a compact buffer, and then again read for sending. (If you have expensive network cards they may actually stream straight from memory, but don't count on that.) So there can be a bandwidth penalty on using derived types.
MPI may not send the whole message at once, for instance because it is reluctant to create really big internal buffers.
If you do your sends in a timing loop, you run into the fact that MPI is stateless, so it will repeatedly allocate and free its internal buffers.

In your specific case, the irregularity of your data, as pointed out by several commenters may also make your Scatterv inefficient.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

MPI_Scatterv from Intel MPI (mpiifort) using MPI data types is much slower (23 times) compared to flattening array and scattering. Why it could be?

问题

答案1

Reading a Fortran Data File in Python

Borwein算法用于在Fortran中计算Pi，收敛速度太快。

goraft中所有节点的状态

重新组织数据以满足所需的顺序

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论