英文:
MPI_Scatterv from Intel MPI (mpiifort) using MPI data types is much slower (23 times) compared to flattening array and scattering. Why it could be?
问题
以下是您提供的代码的翻译部分:
ifort版本:(IFORT)2021.8.0 20221119
操作系统:WSL Ubuntu 20.04 LTS
1. 我有一个(1000x1000x1000)的三维数组要在进程之间分配。在第一种方法中,我将数组展平,然后将数组分发给进程,大约需要7.86秒。
2. 在第二种方法中,我使用MPI派生数据类型来Scatterv三维数组,我注意到这需要约165.34秒。但相同数据的Gatherv花费了约14.24秒。
3. 这种不一致的原因是什么?我预期Scatterv需要与Gatherv相似的时间。
以下是代码:
program ex_scatterv
use mpi
use iso_fortran_env, only : real64
implicit none
! 分配数组
real(real64), allocatable,dimension(:,:,:) :: array, array_local
real(real64), allocatable,dimension(:) :: array_flat, array_local_flat
integer :: rank, num_procs, i, j, k
integer :: nx, ny, nz, str_idx, end_idx, local_size, local_size_flat
integer, dimension(:), allocatable :: sendcounts, displacements
integer :: sizes(3), sub_sizes(3), starts(3), recv_starts(3), recv_sizes(3), &
send_type, resize_send_type, recv_type, resize_recv_type
integer(kind=8) :: lb, extent, lb_resize
real(real64) :: start_time
integer :: mpierr
call mpi_init(mpierr)
call mpi_comm_size(mpi_comm_world, num_procs, mpierr)
call mpi_comm_rank(mpi_comm_world, rank, mpierr)
! 数组的大小
nx=1000
ny=1000
nz=1000
if(rank==0) then
if(num_procs>nx) then
print*, "进程数量应小于或等于数组的第一维度"
call MPI_Abort(mpi_comm_world, 1, mpierr)
endif
endif
start_time=MPI_Wtime()
! 在根进程中分配
if(rank==0) then
allocate(array(nx,ny,nz))
allocate(array_flat(nx*ny*nz))
else !对于其他进程,分配零大小
allocate(array(0,0,0))
endif
! 为数组分配值
if(rank==0) then
do k=1,nz
do j=1,ny
do i=1,nx
array(i,j,k) = (i-1)+(j-1)*nx+(k-1)*nx*ny
end do
end do
end do
! 展平三维数组
forall(k=1:nz, j=1:ny, i=1:nx) array_flat(k+(j-1)*nz+(i-1)*ny*nz)=array(i,j,k)
endif
! 在不同的进程之间分发三维数组
call distribute_points(nx, rank, num_procs, str_idx, end_idx)
local_size = end_idx - str_idx + 1
local_size_flat = local_size*ny*nz
! 为每个进程分配本地数组
allocate(array_local_flat(local_size_flat))
allocate(array_local(local_size, ny, nz))
! 为广播分配sendcounts和displacements数组
allocate(sendcounts(num_procs), displacements(num_procs))
! 收集所有进程的displacements和sendcounts
call MPI_Allgather(str_idx, 1, MPI_INTEGER, displacements, 1, MPI_INTEGER, &
MPI_COMM_WORLD, mpierr)
call MPI_Allgather(local_size, 1, MPI_INTEGER, sendcounts, 1, &
MPI_INTEGER, MPI_COMM_WORLD, mpierr)
! 总的sendcounts和displacements
sendcounts = sendcounts*ny*nz
displacements = displacements - 1 ! MPI中的数组索引从0开始(C语言)
displacements = displacements*ny*nz
! 将展平的数组分散到进程之间
call MPI_Scatterv(array_flat, sendcounts, displacements, MPI_DOUBLE_PRECISION, &
array_local_flat, local_size*ny*nz, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, &
mpierr)
! 从展平的本地数组中形成三维数组
forall(k=1:nz, j=1:ny, i=1:local_size) array_local(i,j,k) = &
array_local_flat(k+(j-1)*nz+(i-1)*ny*nz)
! 如果rank为0,则打印时间
if(rank==0) then
print*, "展平和分散所花费的时间:", MPI_Wtime()-start_time
endif
call MPI_Barrier(mpi_comm_world, mpierr)
! 释放(array_flat, array_local_flat)
start_time=MPI_Wtime()
! 使用子数组类型进行分散
sizes = [nx, ny, nz]
recv_sizes=[local_size, ny, nz]
sub_sizes = [1, ny, nz]
starts = [0, 0, 0]
recv_starts = [0, 0, 0]
! 获取MPI_DOUBLE_PRECISION的extent
call MPI_Type_get_extent(MPI_DOUBLE_PRECISION, lb, extent, mpierr)
! 为发送数据创建MPI子数组数据类型
call MPI_Type_create_subarray(3, sizes, sub_sizes, starts, &
MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, send_type, mpierr)
lb_resize=0
! 调整发送子数组以从正确位置开始发送
call MPI_Type_create_resized(send_type, lb_resize, extent, &
resize_send_type, mpierr)
call MPI_Type_commit(resize_send_type, mpierr)
! 为接收数据创建MPI子数组数据类型
call MPI_Type_create_subarray(3, recv_sizes, sub_sizes, recv_starts, &
MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, recv_type, mpierr)
! 调整接收子数组以从正确位置开始接收
call MPI_Type_create_resized(recv_type, lb_resize, extent, &
resize_recv_type, mpierr)
call MPI_Type_commit(resize_recv_type, mpierr)
! 为发送和接收子数组分配sendcounts和displacements
sendcounts=sendcounts/(ny*nz)
displacements = displacements/(ny*nz)
if(rank==0) then
print*, "创建MPI类型子数组所花费的时间:", MPI_Wtime()-start_time
endif
call MPI_Barrier(mpi_comm_world, mpierr)
start_time=MPI_Wtime()
! 分散子数组
call MPI_Scatterv(array, sendcounts, displacements, resize_send_type, &
array_local, sendcounts, resize_recv_type, 0,
<details>
<summary>英文:</summary>
ifort version: (IFORT) 2021.8.0 20221119
OS: WSL Ubuntu 20.04LTS
1. I have a (1000x1000x1000) 3D array to distribute among procs. In the first approach, I flatten the array and then distribute arrays among procs and it takes about 7.86sec
2. In the second approach I use MPI-derived data types to Scatterv the 3D array and I noticed that it takes about 165.34sec. But Gatherv of the same data took about 14.24 sec.
3. What could be the reason for this inconsistency? I expected Scatterv to take similar time as Gatherv
Here is the code
```c++
program ex_scatterv
use mpi
use iso_fortran_env, only : real64
implicit none
!allocate arrays
real(real64), allocatable,dimension(:,:,:) :: array, array_local
real(real64), allocatable,dimension(:) :: array_flat, array_local_flat
integer :: rank, num_procs, i, j, k
integer :: nx, ny, nz, str_idx, end_idx, local_size, local_size_flat
integer, dimension(:), allocatable :: sendcounts, displacements
integer :: sizes(3), sub_sizes(3), starts(3), recv_starts(3), recv_sizes(3), &
send_type, resize_send_type, recv_type, resize_recv_type
integer(kind=8) :: lb, extent, lb_resize
real(real64) :: start_time
integer :: mpierr
call mpi_init(mpierr)
call mpi_comm_size(mpi_comm_world, num_procs, mpierr)
call mpi_comm_rank(mpi_comm_world, rank, mpierr)
!size of array
nx=1000
ny=1000
nz=1000
if(rank==0) then
if(num_procs>nx) then
print*, "Number of procs should be less than or equal to first dimension of the array"
call MPI_Abort(mpi_comm_world, 1, mpierr)
endif
endif
start_time=MPI_Wtime()
!allocate in the root rank
if(rank==0) then
allocate(array(nx,ny,nz))
allocate(array_flat(nx*ny*nz))
else !for other procs allocate with zero size
allocate(array(0,0,0))
endif
!assign values to the array
if(rank==0) then
do k=1,nz
do j=1,ny
do i=1,nx
array(i,j,k) = (i-1)+(j-1)*nx+(k-1)*nx*ny
end do
end do
end do
!print*, "Before scattering..."
!print*, array
!flatten the 3D array
forall(k=1:nz, j=1:ny, i=1:nx) array_flat(k+(j-1)*nz+(i-1)*ny*nz)=array(i,j,k)
endif
!distribute the 3d array among different procs
call distribute_points(nx, rank, num_procs, str_idx, end_idx)
local_size = end_idx - str_idx + 1
local_size_flat = local_size*ny*nz
!allocate local(for each rank) arrays
allocate(array_local_flat(local_size_flat))
allocate(array_local(local_size, ny, nz))
!allocate sendcoutns and displacements arrays for braodcasting
allocate(sendcounts(num_procs), displacements(num_procs))
!gather displacements and sendcounts for all ranks
call MPI_Allgather(str_idx, 1, MPI_INTEGER, displacements, 1, MPI_INTEGER, &
MPI_COMM_WORLD, mpierr)
call MPI_Allgather(local_size, 1, MPI_INTEGER, sendcounts, 1, &
MPI_INTEGER, MPI_COMM_WORLD, mpierr)
!total sendcounts and displacements
sendcounts = sendcounts*ny*nz
displacements = displacements - 1 !Array index starts with 0 in MPI (C)
displacements = displacements*ny*nz
!scatter the flattened array among procs
call MPI_Scatterv(array_flat, sendcounts, displacements, MPI_DOUBLE_PRECISION, &
array_local_flat, local_size*ny*nz, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, &
mpierr)
!form 3D array from flattened local array
forall(k=1:nz, j=1:ny, i=1:local_size) array_local(i,j,k) = &
array_local_flat(k+(j-1)*nz+(i-1)*ny*nz)
!print*, "Scattered array: ", rank
!print*, array_local
if(rank==0) then
print*, "Time taken by flatten and scatter: ", MPI_Wtime()-start_time
endif
call MPI_Barrier(mpi_comm_world, mpierr)
!deallocate(array_flat, array_local_flat)
start_time=MPI_Wtime()
!Scatterning using subarray type
sizes = [nx, ny, nz]
recv_sizes=[local_size, ny, nz]
sub_sizes = [1, ny, nz]
starts = [0, 0, 0]
recv_starts = [0, 0, 0]
!to get extent of MPI_DOUBLE_PRECISION
call MPI_Type_get_extent(MPI_DOUBLE_PRECISION, lb, extent, mpierr)
!create a mpi subarray data type for sending data
call MPI_Type_create_subarray(3, sizes, sub_sizes, starts, &
MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, send_type, mpierr)
lb_resize=0
!resize the send subarray for starting at correct location for next send
call MPI_Type_create_resized(send_type, lb_resize, extent, &
resize_send_type, mpierr)
call MPI_Type_commit(resize_send_type, mpierr)
!create a mpi subarray data type for receiving data
call MPI_Type_create_subarray(3, recv_sizes, sub_sizes, recv_starts, &
MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, recv_type, mpierr)
!resize the receive subarray for starting at correct location for next receive
call MPI_Type_create_resized(recv_type, lb_resize, extent, &
resize_recv_type, mpierr)
call MPI_Type_commit(resize_recv_type, mpierr)
!sendcounts and displacement for sending and receiving subarrays
sendcounts=sendcounts/(ny*nz)
displacements = displacements/(ny*nz)
if(rank==0) then
print*, "Time taken for creating MPI type subarrays: ", MPI_Wtime()-start_time
endif
call MPI_Barrier(mpi_comm_world, mpierr)
start_time=MPI_Wtime()
!scatter the subarrays
call MPI_Scatterv(array, sendcounts, displacements, resize_send_type, &
array_local, sendcounts, resize_recv_type, 0, MPI_COMM_WORLD, mpierr)
if(rank==0) then
print*, "Time taken for scattering using MPI type subarrays: ", MPI_Wtime()-start_time
endif
call MPI_Barrier(mpi_comm_world, mpierr)
!print the scattered array
!print*, "Scattered array with subarray: ", rank
!print*, array_local
!do some computations on the scattered local arrays
array_local = array_local+1
call MPI_Barrier(mpi_comm_world, mpierr)
start_time=MPI_Wtime()
!Gather the local arrays to global (array) using the same subarrays
call MPI_Gatherv(array_local, local_size, resize_recv_type, array, &
sendcounts, displacements, resize_send_type, 0, MPI_COMM_WORLD, mpierr)
if(rank==0) then
print*, "Time taken by MPI_Type_create_subarray Gathering: ", MPI_Wtime()-start_time
endif
!if(rank==0) then
! print*, "Gathered array: ------------------"
! print*, array
!endif
call MPI_Finalize(mpierr)
contains
subroutine distribute_points(npts, rank, size, start_idx, end_idx)
implicit none
integer, intent(in) :: npts, size, rank
integer, intent(out) :: start_idx, end_idx
integer :: pts_per_proc
pts_per_proc = npts/size
if(rank < mod(npts, size)) then
pts_per_proc=pts_per_proc + 1
end if
if(rank < mod(npts, size)) then
start_idx = rank * pts_per_proc + 1
end_idx = (rank + 1) * pts_per_proc
else
start_idx = mod(npts, size) + rank*pts_per_proc + 1
end_idx = mod(npts, size) + (rank + 1) * pts_per_proc
end if
end subroutine distribute_points
end program ex_scatterv
答案1
得分: 1
有许多原因会导致MPI数据类型比用户级打包和发送操作慢。
我在https://arxiv.org/abs/1809.10778中进行了探讨。
- 数据类型与普通缓冲区不同,它们不直接从内存流出:它们会被读取,写入到紧凑的缓冲区,然后再次用于发送。(如果您的网络卡昂贵,它们可能实际上会直接从内存流出,但不要指望这一点。)因此,使用派生类型可能会带来带宽损失。
- MPI可能不会一次发送整个消息,例如因为它不愿意创建非常大的内部缓冲区。
- 如果您在一个定时循环中执行发送操作,您会遇到MPI是无状态的事实,因此它会反复分配和释放其内部缓冲区。
在您的特定情况下,正如一些评论中指出的,您的数据的不规则性也可能会使您的Scatterv
操作效率低下。
英文:
There are many reasons why MPI datatypes can be slower than a user-level pack-and-send operation.
I have explored this in https://arxiv.org/abs/1809.10778
- Data types, unlike plain buffers, are not streamed straight from memory: they are read, written to a compact buffer, and then again read for sending. (If you have expensive network cards they may actually stream straight from memory, but don't count on that.) So there can be a bandwidth penalty on using derived types.
- MPI may not send the whole message at once, for instance because it is reluctant to create really big internal buffers.
- If you do your sends in a timing loop, you run into the fact that MPI is stateless, so it will repeatedly allocate and free its internal buffers.
In your specific case, the irregularity of your data, as pointed out by several commenters may also make your Scatterv
inefficient.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论