MPI编程技术汇总

  • Post author:
  • Post category:其他



济南友泉软件有限公司

本文主要整理MPI编程相关的基本概念、基本框架等相关内容。

限于笔者认知水平与研究深度,难免有不当指出,欢迎批评指正!



注1:文章内容会不定期更新。



注2:限于篇幅,部分内容未标明出处,请谅解。

在高性能计算场景中,常用的消息传递编程模型包括


MPI


(Message Passing Interface)和


PVM


(Parallel Virtual Machine)等。MPI(Message Passing Interface)以其

高效的通信性能



较好的移植性



强大的功能

等优点而成为了消息传递编程模型(事实上)的行业标准。



MPI


是一种基于消息传递的分布式并行计算编程模型与标准。各个厂商或组织遵循这些标准实现自己的MPI软件包,典型的实现包括开放源代码的MPICH、OpenMPI、MS-MPI、DeinoMPI以及不开放源代码的Intel MPI。

零、并行程序设计导论

并行计算机(parallel computer):指能在同一时间执行多条指令的计算机。

并行编程模型(parallel programming model):并行计算机体系架构的一种抽象。

并行计算(parallel computing):是用多个处理器来协同求解同一问题,即将被求解的问题分解成若干个部分,各部分均由一个独立的处理机来并行计算。

并行算法(parallel algorithm):就是多个处理器来协同求解问题的方法和步骤。

一、MPI入门

1.1 MPI基础

MPI编程模型:a)


进程级并行


,同一份代码对应多个进程,每个并行进程均有自己独立的地址空间,每个进程通过communicator rank标记; b)


消息通信


,进程间通过显式地发送和接收消息来实现通信。

进程组:不同进程构成的有序集合。

A group is an ordered set of process identifiers (henceforth processes); processes are implementation-dependent objects. Each process in a group is associated with an integer rank. Ranks are contiguous and start from zero.

通信上下文:描述了一个通信空间,不同通信上下文的消息互不干涉、相互独立。

A context is a property of communicators (defined next) that allows partitioning of the communication space. A message sent in one context cannot be received in another context.  Contexts are not explicit MPI objects; they appear only as part of the realization of communicators.

通信域(Communicator):由进程组、通信上下文等组成,用于描述进程间的通信关系,也就是定义哪些进程间可以通信。

同步(Synchronization):发起调用之后,

本调用者

必须等到

所有调用者




执行到此

同步调用点

,确保

不同调用者



同步调用点

前面的调用均已执行完毕。

阻塞(Blocking):发起调用之后,

本调用者

必须等到

本次调用

对应的事务处理完毕。

有序接收:阻塞/非阻塞通信均遵循有效接收的语义约束,即先发送的消息一定被先匹配的消息调用接受,即便后发送的消息先到达。

1.2 MPI网络通信模型

由于MPI抽象了底层的网络通信细节,所以,大多数情况下,并不需要掌握MPI底层网络通信模型。实际上,关于这一部分的网络资料很少,仅发现一篇讲述Open MPI架构的资料。


The Architecture of Open Source Applications (Volume 2): Open MPI

目前,OpenMPI、MPICH、Intel MPI支持OpenFabrics Interfaces (OFI)以实现了不同底层网络硬件的通信。OFI是一套网络通信编程接口,主要包括Libfabric、OFI provider等组件。

Ref. from Libfabric——————————————————————————————————–

OFI is a framework focused on exporting communication services to applications. OFI is specifically designed to meet the performance and scalability requirements of high-performance computing (HPC) applications running in a tightly coupled network environment. The key components of OFI are application interfaces, provider libraries, kernel services, daemons, and test applications.

Libfabric is a library that defines and exports the user-space API of OFI, and is typically the only software that applications deal with directly. The libfabric’s API does not depend on the underlying networking protocols, as well as on the implementation of the particular networking devices, over which it may be implemented. OFI is based on the notion of application centric I/O, meaning that the libfabric library is designed to align fabric services with application needs, providing a tight semantic fit between applications and the underlying fabric hardware. This reduces overall software overhead and improves application efficiency when transmitting or receiving data over a fabric.

——————————————————————————————————–Ref. from Libfabric

1.3 MPI数据类型

由于计算集群中每台计算机软硬件架构可能不同(不同的厂商、不同的处理器、不同的操作系统等),因此,可能存在不同的数据表示方法。为了支持异构计算,MPI提供了


预定义数据类型


来解决异构计算中的互操作性问题。

除此之外,MPI引入


派生数据类型


的概念,允许定义由数据类型不同且地址空间不连续的数据项组成的消息。


Tab.1 MPI数据类型



MPI



FORTRAN



C


MPI_INTEGER


MPI_INTEGER1


MPI_INTEGER2


MPI_INTEGER4

INTEGER

INTEGER*1

INTEGER*2

INTEGER*4


MPI_REAL


MPI_REAL2


MPI_REAL4


MPI_REAL8


MPI_DOUBLE_PRECISION

REAL

REAL*2

REAL*4

REAL*8

DOUBLE PRECISION


MPI_COMPLEX


MPI_DOUBLE_COMPLEX

COMPLEX

DOUBLE COMPLEX


MPI_LOGICAL

LOGICAL


MPI_CHARACTER

CHARACTER(1)


MPI_CHAR


MPI_UNSIGNED_CHAR


MPI_SHORT


MPI_UNSIGNED_SHORT


MPI_UNSIGNED


MPI_INT


MPI_UNSIGNED_LONG


MPI_LONG


MPI_LONG_LONG_INT

signed char

unsigned char

signed short int

unsigned short int

unsigned int

signed int

unsigned long int

signed long int

long long int


MPI_FLOAT


MPI_DOUBLE


MPI_LONG_DOUBLE

float

double

long double


MPI_BYTE


MPI_PACKED

1.4 常用编程接口

虽然MPI定义编程接口有很多,但是实际工作中经常使用的却并不多。下表罗列了部分常用的MPI编程接口。



Tab. 2 常用MPI接口



接口



参数



说明



MPI_INIT



Initialize the MPI execution environment



MPI_FINALIZE



Terminates MPI execution environment


MPI_ABORT(comm,rank)


comm

[in] communicator of tasks to abort


errorcode

[in] error code to return to invoking environment

Terminates all MPI processes associated with the communicator comm

MPI_WTIME
MPI_WTIME returns a floating-point number of seconds, representing elapsed wall-clock time since some time in the past
MPI_BARRIER(comm)


comm

[in] communicator (handle)

Blocks the caller until all processes in the communicator have called it; that is, the call returns at any process only after all members of the communicator have entered the call.

MPI_GROUP_INCL

(

group,

n,

ranks,

newgroup

)


group

[in] group (handle)


n

[in] number of elements in array ranks (and size of newgroup ) (integer)


ranks

[in] ranks of processes in group to appear in newgroup (array of integers)


newgroup

[out] new group derived from above, in the order defined by ranks (handle)

Produces a group by reordering an existing group and taking only listed members


MPI_COMM_RANK(comm, rank)



comm


[in] communicator (handle)



rank


[out] rank of the calling process in the group of comm (integer)


Determines the rank of the calling process in the communicator


MPI_COMM_SIZE(comm, size)



comm


[in] communicator (handle)



size


[out] number of processes in the group of comm (integer)


Determines the size of the group associated with a communicator

MPI_COMM_GROUP(comm, group)


comm

[in] Communicator (handle)


group

[out] Group in communicator (handle)

Accesses the group associated with given communicator

MPI_COM_CREATE

(

comm,

group,

newcomm

)


comm

[in] communicator (handle)


group

[in] group, which is a subset of the group of comm (handle)


newcomm

[out] new communicator (handle)

Creates a new communicator

MPI_COM_SPLIT

(

comm,

color,

key,

newcomm

)


comm

[in] communicator (handle)


color

[in] control of subset assignment (nonnegative integer). Processes with the same color are in the same new communicator


key

[in] control of rank assigment (integer)


newcomm

[out] new communicator (handle)

Creates new communicators based on colors and keys



MPI_SEND



(



buf,



count,



datatype,



dest,tag,



comm



)




buf




[in] initial address of send buffer (choice)




count




[in] number of elements in send buffer (nonnegative integer)




datatype




[in] datatype of each send buffer element (handle)




dest




[in] rank of destination (integer)




tag




[in] message tag (integer)




comm




[in] communicator (handle)



Performs a blocking send



MPI_RECV



(



buf,



count,



datatype,



source,



tag,



comm,



status



)




buf




[out] initial address of receive buffer (choice)




count




[in] maximum number of elements in receive buffer (integer)




datatype




[in] datatype of each receive buffer element (handle)




source




[in] rank of source (integer)




tag




[in] message tag (integer)




comm




[in] communicator (handle)




status




[out] status object (Status)



Blocking receive for a message

MPI_BCAST

(

buffer,

count,

datatype,

root,

comm

)


buffer

[in/out] starting address of buffer (choice)


count

[in] number of entries in buffer (integer)


datatype

[in] data type of buffer (handle)


root

[in] rank of broadcast root (integer)


comm

[in] communicator (handle)

Broadcasts a message from the process with rank “root” to all other processes of the communicator

MPI_GATHER

(

sendbuf,

sendcount,

sendtype,

recvbuf,

recvcount,

recvtype,

root,

comm

)

Gathers together values from a group of processes


sendbuf

[in] starting address of send buffer (choice)


sendcount

[in] number of elements in send buffer (integer)


sendtype

[in] data type of send buffer elements (handle)


recvbuf

[out] address of receive buffer (choice, significant only at root)


recvcount

[in] number of elements for any single receive (integer, significant only at root)


recvtype

[in] data type of recv buffer elements (significant only at root) (handle)


root

[in] rank of receiving process (integer)


comm

[in] communicator (handle)

MPI_GATHERV

(

sendbuf,

sendcount,

sendtype,

recvbuf,

recvcounts,

displs,

recvtype,

root,

comm

)

Gathers into specified locations from all processes in a group


sendbuf

[in] starting address of send buffer (choice)


sendcount

[in] number of elements in send buffer (integer)


sendtype

[in] data type of send buffer elements (handle)


recvbuf

[out] address of receive buffer (choice, significant only at root)


recvcounts

[in] integer array (of length group size) containing the number of elements that are received from each process (significant only at root)


displs

[in] integer array (of length group size). Entry i specifies the displacement relative to recvbuf at which to place the incoming data from process i (significant only at root)


recvtype

[in] data type of recv buffer elements (significant only at root) (handle)


root

[in] rank of receiving process (integer)


comm

[in] communicator (handle)

MPI_SCATTER

(

sendbuf,

sendcount,

sendtype,

recvbuf,

recvcount,

recvtype,

root,

comm

)

Sends data from one process to all other processes in a communicato


sendbuf

[in] address of send buffer (choice, significant only at root)


sendcount

[in] number of elements sent to each process (integer, significant only at root)


sendtype

[in] data type of send buffer elements (significant only at root) (handle)


recvbuf

[out] address of receive buffer (choice)


recvcount

[in] number of elements in receive buffer (integer)


recvtype

[in] data type of receive buffer elements (handle)


root

[in] rank of sending process (integer)


comm

[in] communicator (handle)

MPI_SCATTERV

(

sendbuf,

sendcounts,

displs,

sendtype,

recvbuf,

recvcount,

recvtype,

root,

comm

)

Scatters a buffer in parts to all processes in a communicator


sendbuf

[in] address of send buffer (choice, significant only at root)


sendcounts

[in] integer array (of length group size) specifying the number of elements to send to each processor


displs

[in] integer array (of length group size). Entry i specifies the displacement (relative to sendbuf from which to take the outgoing data to process i


sendtype

[in] data type of send buffer elements (handle)


recvbuf

[out] address of receive buffer (choice)


recvcount

[in] number of elements in receive buffer (integer)


recvtype

[in] data type of receive buffer elements (handle)


root

[in] rank of sending process (integer)


comm

[in] communicator (handle)

MPI_TYPE_CREATE_STRUCT

(

count,

array_of_blocklengths,

array_of_displacements,

array_of_types,


newtype

)

Create an MPI datatype from a general set of datatypes, displacements, and block sizes


count

[in] number of blocks (integer) — also number of entries in arrays array_of_types, array_of_displacements and array_of_blocklengths


array_of_blocklength

[in] number of elements in each block (array of integer)


array_of_displacements

[in] byte displacement of each block (array of integer)


array_of_types

[in] type of elements in each block (array of handles to datatype objects)


newtype

[out] new datatype (handle)

MPI_GET_ADDRESS(location,address) Get the address of a location in memory


location

[in] location in caller memory (choice)


address

[out] address of location (address)

MPI_TYPE_COMMIT(datatype) Commits the datatype


datatype

[in] datatype (handle)

MPI_TYPE_FREE(datatype) Frees the datatype


datatype

[in] datatype that is freed (handle)

在这些常用MPI接口中,MPI_INIT、MPI_COMM_RANK、MPI_COMM_SIZE、MPI_SEND、MPI_RECV、MPI_FINALIZE等构成了MPI最小编程子集。

1.5 MPI程序示例

本节通过一个简单的MPI程序来阐述MPI程序设计范式。


注意:无论是C还是FORTRAN,通信域内MPI进程rank均是从0开始编号。

#include "mpi.h"
#include <stdio.h>

int main(int argc, char* argv[])
{
	int rank, size, i;
	int buffer[10];
	MPI_Status status;

	MPI_Init(&argc, &argv);
	MPI_Comm_size(MPI_COMM_WORLD, &size);
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	if (size < 2)
	{
		printf("Please run with two processes.\n"); fflush(stdout);
		MPI_Finalize();
		return 0;
	}
	if (rank == 0)
	{
		for (i = 0; i < 10; i++)
			buffer[i] = i;
		MPI_Send(buffer, 10, MPI_INT, 1, 123, MPI_COMM_WORLD);
	}
	if (rank == 1)
	{
		for (i = 0; i < 10; i++)
			buffer[i] = -1;
		MPI_Recv(buffer, 10, MPI_INT, 0, 123, MPI_COMM_WORLD, &status);
		for (i = 0; i < 10; i++)
		{
			if (buffer[i] != i)
				printf("Error: buffer[%d] = %d but is expected to be %d\n", i, buffer[i], i);
		}
		fflush(stdout);
	}
	MPI_Finalize();
	return 0;
}

二、数据类型

为了解决非连续数据(数据块由不同类型数据组成,或者数据块内数据元素不相邻)的发送,MPI可以通过类型图来创建派生数据类型。

2.1 类型图

类型图实际上是一系列二元组<基类型,偏移>的无序集合。即

typemap=\left \{ \left ( t_{0},d_{0} \right ) ,\left ( t_{1},d_{1} \right ) ,\cdots ,\left ( t_{n-1},d_{n-1} \right ) \right \}

其中,基类型描述了数据元素的数据类型;偏移则描述了该数据元素在类型图中的起始位置,偏移量可以是负整数。

类型图的下界
lb\left ( tyepmap \right )

lb\left ( typemap \right )=min\left \{ d_{j} \right \},0\leqslant j\leqslant n-1

类型图的上界
ub\left ( typemap \right )

ub\left ( typemap \right )=max\left \{ d_{j}+sizeof(t_{j}) \right \},0\leqslant j\leqslant n-1

类型图的跨度
extent\left ( typemap \right )

extent\left \{ typemap \right \}=ub\left ( typemap \right )-lb\left ( typemap \right )+\epsilon

地址对界修正量
\epsilon
:使新建数据类型的跨度能被其对界量整除的最小非负整数。

2.2 伪数据类型

MPI提供了两个伪数据类型:MPI_LB、MPI_UB。这两个伪数据类型不占用空间,


主要用于指定新建数据类型的上下界


2.3 相关API

根据不同需要,为了简化派生数据类型的创建,MPI提供了丰富的API用于创建和管理派生数据类型。


API

参数

说明

MPI_Type_commit



datatype


[in] datatype (handle)


Commits the datatype

MPI_Type_free



datatype


[in] datatype that is freed (handle)


Frees the datatype
MPI_Type_contiguous


count

[in] replication count (nonnegative integer)


oldtype

[in] old datatype (handle)


newtype

[out] new datatype (handle)

Creates a contiguous datatype
MPI_Type_vector


count

[in] number of blocks (nonnegative integer)


blocklength

[in] number of elements in each block (nonnegative integer)


stride

[in] number of elements between start of each block (integer)


oldtype

[in] old datatype (handle)


newtype_p

[out] new datatype (handle)

Creates a vector (strided) datatype
MPI_Type_create_hvector


count

[in] number of blocks (nonnegative integer)


blocklength

[in] number of elements in each block (nonnegative integer)


stride

[in] number of bytes between start of each block (integer)


oldtype

[in] old datatype (handle)


newtype

[out] new datatype (handle)

Create a datatype with a constant stride given in bytes
MPI_Type_indexed


count

[in] number of blocks — also number of entries in indices and blocklens


blocklens

[in] number of elements in each block (array of nonnegative integers)


indices

[in] displacement of each block in multiples of old_type (array of integers)


old_type

[in] old datatype (handle)


newtype

[out] new datatype (handle)

Creates an indexed datatype
MPI_Type_create_hindexed


count

[in] number of blocks — also number of entries in displacements and blocklengths (integer)


blocklengths

[in] number of elements in each block (array of nonnegative integers)


displacements

[in] byte displacement of each block (array of integer)


oldtype

[in] old datatype (handle)


newtype

[out] new datatype (handle)

Create a datatype for an indexed datatype with displacements in bytes

MPI_Type_create_struct



count


[in] number of blocks (integer) — also number of entries in arrays array_of_types, array_of_displacements and array_of_blocklengths



array_of_blocklength


[in] number of elements in each block (array of integer)



array_of_displacements


[in] byte displacement of each block (array of integer)



array_of_types


[in] type of elements in each block (array of handles to datatype objects)



newtype


[out] new datatype (handle)


Create an MPI datatype from a general set of datatypes, displacements, and block sizes
MPI_Type_extent


datatype

[in] datatype (handle)


extent

[out] datatype extent (integer)

Returns the extent of a datatype
MPI_Type_size


datatype

[in] datatype (handle)


size

[out] datatype size (integer)

Return the number of bytes occupied by entries in the datatype
MPI_Type_create_resized


oldtype

[in] input datatype (handle)


lb

[in] new lower bound of datatype (integer)


extent

[in] new extent of datatype (integer)


newtype

[out] output datatype (handle)

Create a datatype with a new lower bound and extent from an existing datatype
MPI_Type_lb


datatype

[in] datatype (handle)


displacement

[out] displacement of lower bound from origin, in bytes (integer)

Returns the lower-bound of a datatype
MPI_Type_ub


datatype

[in] datatype (handle)


displacement

[out] displacement of upper bound from origin, in bytes (integer)

Returns the upper bound of a datatype

三、点对点通信

在MPI中,点对点通信(Point-to-Point Communication)指的是一对进程之间的通信,一个进程发送消息,另一个进程接收消息。点对点通信是MPI通信的基础。

MPI点对点通信原理


=============================

Ref. from

MPI


: A Message-Passing Interface Standard


Sending and receiving of



messages



by processes is the basic MPI communication mechanism. The basic point-to-point communication operations are



send



and



receive



.

Ref. from

MPI


: A Message-Passing Interface Standard =============================

3.1 MPI消息

MPI消息由信封与数据等两部分组成。信封部分描述了消息发送或接收进程及其相关信息;数据部分是消息的主体,承载需要发送的内容。

MPI消息结构


=============================

Ref. from

MPI


: A Message-Passing Interface Standard

The data part of the message consists of a sequence of


count


values, each of the type indicated by


datatype


.

count

may be zero, in which case the data part of the message is empty.

In addition to the data part, messages carry information that can be used to distinguish messages and selectively receive them. This information consists of a fixed number of fields, which we collectively call the message envelope. These fields are source, destination, tag and communicator.

The message source is implicitly determined by the identity of the message sender. The other fields are specified by arguments in the send procedure.

The message destination is specified by the dest argument.

The integer-valued message tag is specified by the tag argument. This integer can be used by the program to distinguish different types of messages.

A communicator specifies the communication context for a communication operation. Each communication context provides a separate “communication universe”: messages are always received within the context they were sent, and messages sent in different contexts do not interfere. The communicator also specifies the set of processes that share this communication context. This process group is ordered and processes are identified by their rank within this group.

Ref. from

MPI


: A Message-Passing Interface Standard =============================

3.2 通信模式

在MPI中,通信模式(Communication Modes)包含


标准通信模式


(standard mode)、


缓存通信模式


(buffered-mode)、


同步通信模式


(synchronous-mode)、


就绪通信模式


(ready-mode)等四种。

通信模式 发送 接收
标准通信模式 MPI_SEND MPI_RECV
缓存通信模式 MPI_BSEND
同步通信模式 MPI_SSEND
就绪通信模式 MPI_RSEND

3.3 相关API

MPI 点对点通信可以分为同步通信和异步通信等二种机制。
接口 参数 描述

MPI_Send


buf

[in] initial address of send buffer (choice)


count

[in] number of elements in send buffer (nonnegative integer)


datatype

[in] datatype of each send buffer element (handle)


dest

[in] rank of destination (integer)


tag

[in] message tag (integer)


comm

[in] communicator (handle)

Performs a blocking send

MPI_Recv


buf

[out] initial address of receive buffer (choice)


count

[in] maximum number of elements in receive buffer (integer)


datatype

[in] datatype of each receive buffer element (handle)


source

[in] rank of source (integer)


tag

[in] message tag (integer)


comm

[in] communicator (handle)


status

[out] status object (Status)

Blocking receive for a message
MPI_Sendrecv


sendbuf

[in] initial address of send buffer (choice)


sendcount

[in] number of elements in send buffer (integer)


sendtype

[in] type of elements in send buffer (handle)


dest

[in] rank of destination (integer)


sendtag

[in] send tag (integer)


recvbuf

[out] initial address of receive buffer (choice)


recvcount

[in] number of elements in receive buffer (integer)


recvtype

[in] type of elements in receive buffer (handle)


source

[in] rank of source (integer)


recvtag

[in] receive tag (integer)


comm

[in] communicator (handle)


status

[out] status object (Status). This refers to the receive operation.

Sends and receives a message
MPI_Bsend


buf

[in] initial address of send buffer (choice)


count

[in] number of elements in send buffer (nonnegative integer)


datatype

[in] datatype of each send buffer element (handle)


dest

[in] rank of destination (integer)


tag

[in] message tag (integer)


comm

[in] communicator (handle)

Basic send with user-provided buffering
MPI_Ssend


buf

[in] initial address of send buffer (choice)


count

[in] number of elements in send buffer (nonnegative integer)


datatype

[in] datatype of each send buffer element (handle)


dest

[in] rank of destination (integer)


tag

[in] message tag (integer)


comm

[in] communicator (handle)

Blocking synchronous send
MPI_Rsend


buf

[in] initial address of send buffer (choice)


count

[in] number of elements in send buffer (nonnegative integer)


datatype

[in] datatype of each send buffer element (handle)


dest

[in] rank of destination (integer)


tag

[in] message tag (integer)


comm

[in] communicator (handle)

Blocking ready send
MPI_Buffer_attach


buffer

[in] initial buffer address (choice)


size

[in] buffer size, in bytes (integer)

Attaches a user-provided buffer for sending
MPI_Buffer_detach


buffer

[out] initial buffer address (choice)


size

[out] buffer size, in bytes (integer)

Removes an existing buffer (for use in

MPI_Bsend

etc).

This operation will block until all messages currently in the buffer have been transmitted. Upon return of this function, the user may reuse or deallocate the space taken by the buffer.

四、组通信

组通信(Collective Communication, 又称集合通信)需要一个一组进程内的所有进程同时参加通信,组内各个不同进程的调用形式完全相同。


=============================

Ref. from

MPI


: A Message-Passing Interface Standard


Collective communication is defined as communication that involves a group or groups of processes.

One of the key arguments in a call to a collective routine is a communicator that

defines the group or groups of participating processes and provides a context for the operation.

Ref. from

MPI


: A Message-Passing Interface Standard =============================


4.1 功能概述

可以从


通信





同步





计算


等三个方面来分析组通信API。
  • 通信

组通信实现了进程组内所有进程的通信,这个进程组由组通信API的通信域参数进行指定。MPI确保


组通信消息不会与点-点通信相互混淆



One of the key arguments in a call to a collective routine is a communicator that defines the group or groups of participating processes and provides a context for the operation

.

Several collective routines such as broadcast and gather have a single originating or receiving process. Such a process is called the root.

Some arguments in the collective functions are specified as “significant only at root,” and are ignored for all participants except the root.

Collective communication calls may use the same communicators as point-to-point communication;

MPI guarantees that messages generated on behalf of collective communication calls will not be confused with messages generated by point-to-point communication.

The collective operations do not have a message tag argument.

  • 同步

MPI组通信并不承诺组通信调用返回后,本进程组内的其他进程是否已经完成调用,它只是保证返回调用的进程相对应的操作已经完成,此时可以释放缓冲区或者使用缓冲区内的内容。因此,


不应当寄希望组通信调用会起到同步的作用


Collective operations can (but are not required to) complete as soon as the caller’s participation in the collective communication is finished. A blocking operation is complete as soon as the call returns. A nonblocking (immediate) call requires a separate completion call.

The completion of a collective operation indicates that the caller is free to modify locations in the communication buffer. It does not indicate that other processes in the group have completed or even started the operation (unless otherwise implied by the description of the operation). Thus,


a collective communication operation may, or may not, have the effect of synchronizing all participating MPI processes.

It is dangerous to rely on synchronization side-effects of the collective operations for program correctness.

For example, even though a particular implementation may provide a broadcast routine with a side-effect of synchronization, the standard does not require this, and a program that relies on this will not be portable

. On the other hand, a correct, portable program must allow for the fact that a collective call may be synchronizing. Though one cannot rely on any synchronization side-effect, one must program so as to allow it.

  • 计算

组通信也提供了


归约





扫描


等操作。

Global

reduction

operations such as sum, max, min, or user-defined functions, where the result is returned to all members of a groupand a variation where the result is returned to only one member.


4.2 常用API

根据通信方式、数据操作等,MPI提供了大量的API支持组通信。


接口

参数

说明
MPI_Barrier


comm

[in] communicator (handle)

Blocks until all processes in the communicator have reached this routine.
MPI_Bcast


buffer

[in/out] starting address of buffer (choice)


count

[in] number of entries in buffer (integer)


datatype

[in] data type of buffer (handle)


root

[in] rank of broadcast root (integer)


comm

[in] communicator (handle)

Broadcasts a message from the process with rank “root” to all other processes of the communicator
MPI_Scatter


sendbuf

[in] address of send buffer (choice, significant only at root)


sendcount

[in] number of elements sent to each process (integer, significant only at root)


sendtype

[in] data type of send buffer elements (significant only at root) (handle)


recvbuf

[out] address of receive buffer (choice)


recvcount

[in] number of elements in receive buffer (integer)


recvtype

[in] data type of receive buffer elements (handle)


root

[in] rank of sending process (integer)


comm

[in] communicator (handle)

Sends data from one process to all other processes in a communicator
MPI_Scatterv


sendbuf

[in] address of send buffer (choice, significant only at root)


sendcounts

[in] integer array (of length group size) specifying the number of elements to send to each processor


displs

[in] integer array (of length group size). Entry i specifies the displacement (relative to sendbuf from which to take the outgoing data to process i


sendtype

[in] data type of send buffer elements (handle)


recvbuf

[out] address of receive buffer (choice)


recvcount

[in] number of elements in receive buffer (integer)


recvtype

[in] data type of receive buffer elements (handle)


root

[in] rank of sending process (integer)


comm

[in] communicator (handle)

Scatters a buffer in parts to all processes in a communicator
MPI_Gather


sendbuf

[in] starting address of send buffer (choice)


sendcount

[in] number of elements in send buffer (integer)


sendtype

[in] data type of send buffer elements (handle)


recvbuf

[out] address of receive buffer (choice, significant only at root)


recvcount

[in] number of elements for any single receive (integer, significant only at root)


recvtype

[in] data type of recv buffer elements (significant only at root) (handle)


root

[in] rank of receiving process (integer)


comm

[in] communicator (handle)

Gathers together values from a group of processes
MPI_Gatherv


sendbuf

[in] starting address of send buffer (choice)


sendcount

[in] number of elements in send buffer (integer)


sendtype

[in] data type of send buffer elements (handle)


recvbuf

[out] address of receive buffer (choice, significant only at root)


recvcounts

[in] integer array (of length group size) containing the number of elements that are received from each process (significant only at root)


displs

[in] integer array (of length group size). Entry i specifies the displacement relative to recvbuf at which to place the incoming data from process i (significant only at root)


recvtype

[in] data type of recv buffer elements (significant only at root) (handle)


root

[in] rank of receiving process (integer)


comm

[in] communicator (handle)

Gathers into specified locations from all processes in a group
MPI_Allgather


sendbuf

[in] starting address of send buffer (choice)


sendcount

[in] number of elements in send buffer (integer)


sendtype

[in] data type of send buffer elements (handle)


recvbuf

[out] address of receive buffer (choice)


recvcount

[in] number of elements received from any process (integer)


recvtype

[in] data type of receive buffer elements (handle)


comm

[in] communicator (handle)

Gathers data from all tasks and distribute the combined data to all tasks
MPI_Allgatherv


sendbuf

[in] starting address of send buffer (choice)


sendcount

[in] number of elements in send buffer (integer)


sendtype

[in] data type of send buffer elements (handle)


recvbuf

[out] address of receive buffer (choice)


recvcounts

[in] integer array (of length group size) containing the number of elements that are received from each process


displs

[in] integer array (of length group size). Entry i specifies the displacement (relative to recvbuf ) at which to place the incoming data from process i


recvtype

[in] data type of receive buffer elements (handle)


comm

[in] communicator (handle)

Gathers data from all tasks and deliver the combined data to all tasks
MPI_Alltoall


sendbuf

[in] starting address of send buffer (choice)


sendcount

[in] number of elements to send to each process (integer)


sendtype

[in] data type of send buffer elements (handle)


recvbuf

[out] address of receive buffer (choice)


recvcount

[in] number of elements received from any process (integer)


recvtype

[in] data type of receive buffer elements (handle)


comm

[in] communicator (handle)

Sends data from all to all processes
MPI_Alltoallv


sendbuf

[in] starting address of send buffer (choice)


sendcounts

[in] integer array equal to the group size specifying the number of elements to send to each processor


sdispls

[in] integer array (of length group size). Entry j specifies the displacement (relative to sendbuf from which to take the outgoing data destined for process j


sendtype

[in] data type of send buffer elements (handle)


recvbuf

[out] address of receive buffer (choice)


recvcounts

[in] integer array equal to the group size specifying the maximum number of elements that can be received from each processor


rdispls

[in] integer array (of length group size). Entry i specifies the displacement (relative to recvbuf at which to place the incoming data from process i


recvtype

[in] data type of receive buffer elements (handle)


comm

[in] communicator (handle)

Sends data from all to all processes; each process may send a different amount of data and provide displacements for the input and output data.
MPI_Reduce


sendbuf

[in] address of send buffer (choice)


recvbuf

[out] address of receive buffer (choice, significant only at root)


count

[in] number of elements in send buffer (integer)


datatype

[in] data type of elements of send buffer (handle)


op

[in] reduce operation (handle)


root

[in] rank of root process (integer)


comm

[in] communicator (handle)

Reduces values on all processes to a single value
MPI_Allreduce


sendbuf

[in] starting address of send buffer (choice)


recvbuf

[out] starting address of receive buffer (choice)


count

[in] number of elements in send buffer (integer)


datatype

[in] data type of elements of send buffer (handle)


op

[in] operation (handle)


comm

[in] communicator (handle)

Combines values from all processes and distributes the result back to all processes
MPI_Scan


sendbuf

[in] starting address of send buffer (choice)


recvbuf

[out] starting address of receive buffer (choice)


count

[in] number of elements in input buffer (integer)


datatype

[in] data type of elements of input buffer (handle)


op

[in] operation (handle)


comm

[in] communicator (handle)

Computes the scan (partial reductions) of data on a collection of processes

4.3 程序示例:计算圆周率
\pi


f\left ( x \right )=\frac{4}{1+x^2}
,根据定积分公式
\int_{0}^{1}f\left ( x \right )dx=\int_{0}^{1}\frac{4}{1+x^2}dx =\pi
,,

则有
\pi \approx \frac{1}{N}\sum_{i=1}^{N}f\left ( \frac{i-0.5}{N} \right )=\sum_{i=1}^{N}\frac{4N}{N^2+\left ( i-0.5 \right )^2}
可以看出,问题具有较好的可并行性,可以考虑通过MPI来实现。
// pi.cpp : This file contains the 'main' function. Program execution begins and ends there.
//

#include <iostream>
#include <mpi.h>

const double PI = 4.0 * atan(1.0);

int main(int argc, char *argv[])
{
    ::MPI_Init(&argc, &argv);
    
    int size;
    ::MPI_Comm_size(MPI_COMM_WORLD, &size);

    int rank;
    ::MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    int N = 100; 

    // the lb and up for local integral
    int nx_begin = (rank + 0) * (int)std::ceil(1.0 * N / size);
    int nx_end = (rank + 1) * (int)std::ceil(1.0 * N / size);

    if (rank == 0)
    {
        std::cout << "nx_begin = " << nx_begin << ", nx_end = " << nx_end << std::endl;
    }

    double dx = 1.0 / N;
    double sum = 0.0;

    // numerical integral for process 'rank' 
    for (int i = nx_begin; i < nx_end; ++i)
    {
        if (i < N)
        {
            double x = dx * (i + 0.5);
            sum = sum + 4.0 / (1.0 + x * x);
        }
    }

    sum = sum / N;

    ::MPI_Barrier(MPI_COMM_WORLD);

    // sum all integrals
    double pi = 0.0;

    ::MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
    //::MPI_Allreduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);

    // output the result and error
    if (rank == 0 || rank == 1)
    {
        std::cout << "Process " << rank << ", pi estamated is " << pi << ", error " << fabs(pi - PI) / PI << std::endl;
    }

    ::MPI_Finalize();
}

五、进程组和通信域

通信域、进程组是MPI中的比较重要的改建,构成了MPI网络通信的核心组件。

5.1 通信域

通信域(Communicator)定义了MPI进程通信的范围与状态。MPI通信域由通信上下文、进程组、进程拓扑、属性等内容组成。

当调用MPI_INIT函数之后,MPI自动创建了MPI_COMM_NULL、MPI_COMM_SELF、MPI_COMM_WORLD等通信域。

5.2 进程组

进程组实际上是由一些列进程构成的有序集合。无论对于FORTRAN还是C/C++,进程组内每个进程均有一个从0开始编号的进程ID(通过MPI_COMM_RANK获得进程ID)。

MPI预定义了MPI_GROUP_NULL、MPI_GROUP_EMPTY、MPI_COOM_GROUP等进程组。

5.3 常用API


接口

参数

说明
MPI_Comm_size


comm

[in] communicator (handle)


size

[out] number of processes in the group of comm (integer)

Determines the size of the group associated with a communicator
MPI_Comm_rank


comm

[in] communicator (handle)


rank

[out] rank of the calling process in the group of comm (integer)

Determines the rank of the calling process in the communicator
MPI_Comm_create


comm

[in] communicator (handle)


group

[in] group, which is a subset of the group of comm (handle)


newcomm

[out] new communicator (handle)

Creates a new communicator
MPI_Comm_split


comm

[in] communicator (handle)


color

[in] control of subset assignment (nonnegative integer). Processes with the same color are in the same new communicator


key

[in] control of rank assigment (integer)


newcomm

[out] new communicator (handle)

Creates new communicators based on colors and keys
MPI_Comm_free


comm

[in] Communicator to be destroyed (handle)

Marks the communicator object for deallocation
MPI_Comm_group


comm

[in] Communicator (handle)


group

[out] Group in communicator (handle)

Accesses the group associated with given communicator
MPI_Group_size


group

[in] group (handle) Output Parameter:


size

[out] number of processes in the group (integer)

Returns the size of a group
MPI_Group_rank


group

[in] group (handle)


rank

[out] rank of the calling process in group, or MPI_UNDEFINED if the process is not a member (integer)

Returns the rank of this process in the given group
MPI_Group_incl


group

[in] group (handle)


n

[in] number of elements in array ranks (and size of newgroup ) (integer)


ranks

[in] ranks of processes in group to appear in newgroup (array of integers)


newgroup

[out] new group derived from above, in the order defined by ranks (handle)

Produces a group by reordering an existing group and taking only listed members
MPI_Group_excl


group

[in] group (handle)


n

[in] number of elements in array ranks (integer)


ranks

[in] array of integer ranks in group not to appear in newgroup


newgroup

[out] new group derived from above, preserving the order defined by group (handle)

Produces a group by reordering an existing group and taking only unlisted members
MPI_Group_free


group

[in] group to free (handle)

Frees a group

六、非阻塞通信

在MPI中,非阻塞通信(Nonblocking Communication)调用不等到通信操作完成便可以返回,在这过程中,处理器可以同时进行其他操作,从而实现


计算与通信的重叠


,进而大大提高程序执行的效率。

====================Ref. from 都志辉 <高性能计算并行编程激素-MPI并行程序设计>

Ref. from 都志辉 <高性能计算并行编程激素-MPI并行程序设计>====================

6.1 非阻塞通信API

通信模式 发送 接收
标准通信模式 MPI_ISEND MPI_IRECV
缓存通信模式 MPI_IBSEND
同步通信模式 MPI_ISSEND
就绪通信模式 MPI_IRSEND
重复非阻塞通信 标准通信模式 MPI_SEND_INIT MPI_RECV_INT
缓存通信模式 MPI_BSEND_INIT
同步通信模式 MPI_SSEND_INIT
就绪通信模式 MPI_RSEND_INIT

6.2 非阻塞通信对象

MPI非阻塞通信调用会返回一个非阻塞通信对象request,通过这个返回参数可以查询非阻塞通信状态。

接口 参数 说明
MPI_WAIT


request

[in] request (handle)


status

[out] status object (Status). May be MPI_STATUS_IGNORE.

Waits for an MPI request to complete
MPI_WAITANY


count

[in] list length (integer)


array_of_requests

[in/out] array of requests (array of handles)


index

[out] index of handle for operation that completed (integer). In the range 0 to count-1. In Fortran, the range is 1 to count.


status

[out] status object (Status). May be MPI_STATUS_IGNORE.

Waits for any specified MPI Request to complete
MPI_WAITALL


count

[in] list length (integer)


array_of_requests

[in] array of request handles (array of handles)


array_of_statuses

[out] array of status objects (array of Statuses). May be MPI_STATUSES_IGNORE.

Waits for all given MPI Requests to complete
MPI_WAITSOME


incount

[in] length of array_of_requests (integer)


array_of_requests

[in] array of requests (array of handles)


outcount

[out] number of completed requests (integer)


array_of_indices

[out] array of indices of operations that completed (array of integers)


array_of_statuses

[out] array of status objects for operations that completed (array of Status). May be MPI_STATUSES_IGNORE.

Waits for some given MPI Requests to complete
MPI_TEST


request

[in] MPI request (handle)


flag

[out] true if operation completed (logical)


status

[out] status object (Status). May be MPI_STATUS_IGNORE.

Tests for the completion of a request
MPI_TESTANY


count

[in] list length (integer)


array_of_requests

[in] array of requests (array of handles)


index

[out] index of operation that completed, or MPI_UNDEFINED if none completed (integer)


flag

[out] true if one of the operations is complete (logical)


status

[out] status object (Status). May be MPI_STATUS_IGNORE.

Tests for completion of any previdously initiated requests
MPI_TESTALL


count

[in] lists length (integer)


array_of_requests

[in] array of requests (array of handles)


flag

[out] True if all requests have completed; false otherwise (logical)


array_of_statuses

[out] array of status objects (array of Status). May be MPI_STATUSES_IGNORE.

Tests for the completion of all previously initiated requests
MPI_TESTSOME


incount

[in] length of array_of_requests (integer)


array_of_requests

[in] array of requests (array of handles)


outcount

[out] number of completed requests (integer)


array_of_indices

[out] array of indices of operations that completed (array of integers)


array_of_statuses

[out] array of status objects for operations that completed (array of Status). May be MPI_STATUSES_IGNORE.

Tests for some given requests to complete

七、进程拓扑

进程拓扑(Process Topologies)是 (域内) 通信器的一个附加属性,它描述了一个通信域内各进程间的逻辑联接关系。进程拓扑提供了一种方便的命名机制,辅助映射进程到硬件。


=============================

Ref. from

MPI


: A Message-Passing Interface Standard

A process group in MPI is a collection of n processes. Each process in the group is assigned a rank between 0 and n-1. In many parallel applications a linear ranking of processes does not adequately reflect the logical communication pattern of the processes (which is usually determined by the underlying problem geometry and the numerical algorithm used). Often the processes are arranged in topological patterns such as two- or three-dimensional grids. More generally, the logical process arrangement is described by a graph. In this chapter we will refer to this logical process arrangement as the “virtual topology.”

A topology is an extra, optional attribute that one can give to an intra-communicator; topologies cannot be added to intercommunicators. A topology can provide a convenient naming mechanism for the processes of a group (within a communicator), and additionally, may assist the runtime system in mapping the processes onto hardware.

MPI supports three topology types:

Cartesian

,

graph

, and

distributed

graph.

Ref. from

MPI


: A Message-Passing Interface Standard =============================

7.1 笛卡尔拓扑

笛卡尔拓扑是将通信域内的进程排列成具有规则网格形状的拓扑结构。

7.2 图拓扑

The communication pattern of a set of processes can be represented by a graph. The nodes represent processes, and the edges connect processes that communicate with each other.

7.3 相关API


接口

参数

说明
MPI_Cart_create


comm_old

[in] input communicator (handle)


ndims

[in] number of dimensions of cartesian grid (integer)


dims

[in] integer array of size ndims specifying the number of processes in each dimension


periods

[in] logical array of size ndims specifying whether the grid is periodic (true) or not (false) in each dimension


reorder

[in] ranking may be reordered (true) or not (false) (logical)


comm_cart

[out] communicator with new cartesian topology (handle)

Makes a new communicator to which topology information has been attached
MPI_Graph_create


comm_old

[in] input communicator without topology (handle)


nnodes

[in] number of nodes in graph (integer)


index

[in] array of integers describing node degrees (see below)


edges

[in] array of integers describing graph edges (see below)


reorder

[in] ranking may be reordered (true) or not (false) (logical)


comm_graph

[out] communicator with graph topology added (handle)

Makes a new communicator to which topology information has been attached

八、MPI-2


MPI-2

是 MPI 论坛在 1997 年宣布的 MPI 修改版本。原来的 MPI 被重命名为

MPI-1



MPI-2

增加了增加了


并行I/O





远程存储访问





动态进程管理


等新特性。


8.1 基本概念


串行 I/O

: 读写文件的操作由一个进程完成, 即根结点上的进程 发出 I/O 请求, 将整个文件全部读入, 其他进程需要的数据区域则以消息传递方式从根节点获得; 写文件也同样, 当某进程 的数据区域处理完毕, 需要写回磁盘时, 首先该进程以消息传 递方式发送给主结点, 由主结点对该数据区域进行写。


并行 I/O

: 各进程各自独立地发出 I/O 请求, 以得到相应的数据区域, 即不需要转发的操作,多个处理器可同时对共享文件或不同文件进行 I/O 访问。


阻塞 I/O

: 启动一个I/O操作后,直到 I/O请求完成才返回。


非阻塞 I/O

: 启动一个I/O操作后, 并不等待 I/O 请求完成。


集中式I/O

:通信域内所有进程都必须执行相同I/O调用,提供给调用函数的参数可以不同。


非集中式I/O

:通信域内单个进程就可以执行I/O调用,不需要其他进程参与。


起始偏移(displacement)

:一个文件的起始位置指相对于文件开头以字节为单位的一个绝对地址, 它用来定义一个“文件视窗”的起始位置。


A fifile displacement is an absolute byte position relative to the beginning of a fifile. The displacement defifines the location where a view begins. Note that a “fifile displacement” is distinct from a “typemap displacement.”


基本类型(elementary datatype)

:定义一个文件最小访问单元的 MPI 数据类型, 基本类型可以是任何预定义或用户构造的并已经递交的 MPI 数据类型。


An etype (elementary datatype) is the unit of data access and positioning. It can be any MPI predefifined or derived datatype. Derived etypes can be constructed using any of the MPI datatype constructor routines, provided all resulting typemap displace


ments are non-negative and monotonically nondecreasing. Data access is performed in etype units, reading or writing whole data items of type etype.


Offffsets are expressed as a count of etypes; fifile pointers point to the beginning of etypes. Depending on context, the term “etype” is used to describe one of three aspects of an elementary datatype: a particular MPI type, a data item of that type, or the extent of that type.


文件类型(filetype)

:定义了文件中需要访问的数据分布,文件类型可以等于基本单元类型, 也可以是在基本单元类型基础上构造并已递交的任意 MPI 数据类型。


A filetype is the basis for partitioning a fifile among processes and defifines a template for accessing the fifile. A fifiletype is either a single etype or a derived MPI datatype constructed from multiple instances of the same etype. In addition, the extent of any hole in the fifiletype must be a multiple of the etype’s extent. The displacements in the typemap of the fifiletype are not required to be distinct, but they must be non-negative and monotonically nondecreasing.


文件视口(view):

一个文件中目前可以访问的数据集.,文件视口可以看作一个三元组<起始位置, 基本单元类型, 文件单元类型>。


A



view



defifines the current set of data visible and accessible from an open fifile as an ordered set of etypes. Each process has its own view of the fifile, defifined by three quantities: a displacement, an etype, and a fifiletype. The pattern described by a fifiletype is repeated, beginning at the displacement, to defifine the view. The pattern of repetition is defifined to be the same pattern that


MPI_TYPE_CONTIGUOUS would produce if it were passed the fifiletype and an


arbitrarily large count


Views can be changed by the user during program execution. The default view is a linear byte stream (displacement is zero, etype and fifiletype equal to MPI_BYTE).


8.2 MPI-2 并行I/O

MPI-2在

派生数据类型

基础之上,借助通过

缓存



数据过滤、合并访存

等技术,实现了大量进程对数据文件的并发访问。具体来说,MPI-2 并行I/O对


非集中式I/O使用“数据筛选”技术优化





对集中式I/O使用“两阶段IO”技术优化


Ref. from Walfredo Cirne ======================================================

The most effective approach to ameliorate this problem is parallel I/O. The basic idea is quite simple: If one disk cannot keep the pace of the processor, why not scatter I/O requests on many disks and have them process the requests in parallel?

Three orthogonal aspects characterize the file access operations:

positioning

,

synchronism

, and

coordination

.


Positioning

determines which part of the file is read from (or written to) disk and can be

explicitly specified

on the MPI call, or it can implicitly use the file pointer. In the latter case, there are still two alternatives: each process has its own

individual file pointer

but a unique (per file)

shared pointer

is also available. The effect of multiple calls to shared file pointer routines is defined to behave as if the calls were serialized.


Synchronism

defines whether the requested operation is guaranteed to be finished after the call returns.


Coordination

settles whether the request is an individual or a collective operation.

====================================================== Ref. from Walfredo Cirne

MPI-2 并行IO相关的API比较多,按照

定位方法

可以分为

显示偏移



独立文件指针



共享文件指针

等三类;按照

同步方式

,可以分为

阻塞



非阻塞

等两类;按照

协调方式

,分为

聚合



非聚合

等两类。

MPI 聚合IO通过将不同进程的请求合并,合并后可以对重新对文件进行连续规则的访问,这样可以大大提高IO性能。



Positioning



Synchronism



Coordination


Individual

Collective

explicit

offsets

blocking

MPI_FILE_READ_AT

MPI_FILE_WRITE_AT

MPI_FILE_READ_AT_ALL

MPI_FILE_WRITE_AT_ALL

nonblocking

MPI_FILE_IREAD_AT

MPI_FILE_IWRITE_AT

MPI_FILE_READ_AT_ALL_BEGIN

MPI_FILE_READ_AT_ALL_END

MPI_FILE_WRITE_AT_ALL_BEING

MPI_FILE_WRITE_AT_ALL_END

individual

file pointers

blocking

MPI_FILE_READ

MPI_FILE_WRITE

MPI_FILE_READ_ALL

MPI_FILE_WRITE_ALL

nonblocking

MPI_FILE_IREAD

MPI_FILE_IWRITE

MPI_FILE_READ_ALL_BEGIN

MPI_FILE_READ_ALL_END

MPI_FILE_WRITE_ALL_BEGIN

MPI_FILE_WRITE_ALL_END

shared file

pointers

blocking

MPI_FILE_READ

MPI_FILE_WRITE

MPI_FILE_READ_ORDERED

MPI_FILE_WRITE_ORDERED

nonblocking

MPI_FILE_IREAD_SHARED

MPI_FILE_IWRITE_SHARE

MPI_FILE_READ_ORDERED_BEGIN

MPI_FILE_READ_ORDERED_END

MPI_FILE_WRITE_ ORDERED_BEGIN

MPI_FILE_WRITE_ORDERED_END

九、硬件层面

十、MPI应用:Jacobi迭代

Jacobi迭代法就是众多迭代法中比较早且较简单的一种,Jacobi迭代法的计算公式简单,每迭代一次只需计算一次矩阵和向量的乘法,且计算过程中原始矩阵

A

始终不变,比较容易并行计算。

本节通过Jacobi迭代求解线性方程组来展示MPI并行程序开发的相关知识点。

\begin{bmatrix} 1 & 2 & -2\\ 1& 1 & 1\\ 2& 2& 1 \end{bmatrix}x=\begin{Bmatrix} 1\\ 1\\ 2 \end{Bmatrix}

10.1 Jacobi迭代

对于


非奇异


线性方程组
Ax=b


A=D-L-U
,则有
x=D^{-1}\left ( L+U \right )x+D^{-1}b
,由此可以得到Jacobi迭代公式

x^{\left ( k+1 \right )}=Bx^{\left ( k \right )}+g

其中,
B=D^{-1}\left ( L+U \right ),g=D^{-1}b


x_{i}^{\left ( k+1 \right )}=\frac{1}{a_{ii}}\left ( b_{i}-\sum_{j=1}^{j=i-1}a_{ij}x_{j}^{\left ( k \right )} -\sum_{j=i+1}^{j=n}a_{ij}x_{j}^{\left ( k \right )}\right )

10.2 程序实现

// jacobi.cpp : This file contains the 'main' function. Program execution begins and ends there.
//

#include <iostream>
#include <cassert>
#include <mpi.h>

struct Matrix
{
public:
    //-Constructor
    Matrix(int m, int n)
        : nRows(m)
        , nCols(n)
    {
        data = (double*)malloc(nRows * nCols * sizeof(double));
    }

    ~Matrix()
    {
        if (data) free(data);
    }

    
    //-Access

    double operator()(const int i, const int j) const
    {
        //check(i, j);
        return data[nCols * i + j];
    }

    double& operator()(const int i, const int j)
    {
        //check(i, j);
        return data[nCols * i + j];
    }

    void check(const int i, const int j)
    {
        assert(i < nRows && j < nCols);
    }

    //-verbose
    void print(const char* desc)
    {
        std::cout << std::endl << "=====================================================" << std::endl;
        std::cout << desc << std::endl;;

        std::cout << "data = " << std::endl;
        std::cout << "[" << std::endl;

        for (int i = 0; i < nRows; ++i)
        {
            for (int j = 0; j < nCols; ++j)
            {
                std::cout << "\t" << data[i * nCols + j];
            }
            std::cout << std::endl;
        }

        std::cout << "]"  << std::endl << std::endl;
    }

public:
    double* data;
    int nRows, nCols;
};


int main(int argc, char **argv)
{
    ::MPI_Init(&argc, &argv);

    int size;
    ::MPI_Comm_size(MPI_COMM_WORLD, &size);

    int myid;
    ::MPI_Comm_rank(MPI_COMM_WORLD, &myid);

    int n = 3;

    Matrix A(n, n), b(n, 1), x(n, 1);

    if (myid == 0)
    {
        // Matrix A
        A(0, 0) = 1.0; A(0, 1) = 2.0; A(0, 2) = -2.0;
        A(1, 0) = 1.0; A(1, 1) = 1.0; A(1, 2) = 1.0;
        A(2, 0) = 2.0; A(2, 1) = 2.0; A(2, 2) = 1.0;

        // R.H.S
        b(0, 0) = 1.0;
        b(1, 0) = 1.0;
        b(2, 0) = 2.0;

        // Solution
        x(0, 0) = 0.0;
        x(1, 0) = 0.0;
        x(2, 0) = 0.0;
    }

    ::MPI_Bcast(A.data, A.nRows * A.nCols, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    ::MPI_Bcast(b.data, b.nRows * b.nCols, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    ::MPI_Bcast(x.data, x.nRows * x.nCols, MPI_DOUBLE, 0, MPI_COMM_WORLD);

    //if (myid == 1)
    //{
    //    A.print("Process 1 prints matrix A");
    //    b.print("Process 1 prints matrix b");
    //}

    int row_begin = myid * ( n / size + 1);
    int row_end = (myid + 1) * (n / size + 1);

    double err = FLT_MAX;
    int iter = 0;

    //std::cout << "Process " << myid << ": row_being = " << row_begin << ", row_end = " << row_end << std::endl;

    do 
    {
        // compute new x^{k+1} using Jacobi method
        for (int i = row_begin; i < row_end; ++i)
        {
            if (i >= n)
            {
                break;
            }

            x(i, 0) = b(i, 0);

            for (int j = 0; j < i; ++j)
            {
                x(i, 0) = x(i, 0) - A(i, j) * x(j, 0);
            }

            for (int j = i + 1; j < n; ++j)
            {
                x(i, 0) = x(i, 0) - A(i, j) * x(j, 0);
            }

            x(i, 0) = x(i, 0) / A(i, i);

        }

        //::MPI_Barrier(MPI_COMM_WORLD);

        Matrix x_new(3, 1);

        // Gather the new x^{k+1}
        ::MPI_Gather(x.data + row_begin, row_end - row_begin, MPI_DOUBLE, x_new.data, row_end - row_begin, MPI_DOUBLE, 0, MPI_COMM_WORLD);

        // Update the x^{k} <- x^{k+1}
        ::MPI_Bcast(x_new.data, x.nRows * x.nCols, MPI_DOUBLE, 0, MPI_COMM_WORLD);

        ::memcpy(x.data, x_new.data, n * sizeof(double));
    } 
    while (++iter < 10);

    ::MPI_Barrier(MPI_COMM_WORLD);

    if (myid == 0)
    {
        x.print("The solution x is ");
    }

    ::MPI_Finalize();
}

十一、MPI程序优化

使用非阻塞通信,实现计算-通信重叠。

十二、讨论


Q1

.请思考MPI_ANY_SOURCE、MPI_ANY_TAG、MPI_BYTE、MPI_PACKED的应用场景。


Q2

. 分析下界标记类型MPI_TYPE_LB、上界标记类型MPI_TYPE_UB的用途。



Q3

. 试阐述MPI-2 并行I/O的实现原理。



Q4

. 按照文件读写定位方式,MPI-2 并行I/O API可以分为explicit offset、individual file pointer、shared file pointer等三类,请分析三类并行I/O的适用场景。


Q5

. MPI-2中

MPI_FILE_READ_AT/MPI_FILE_WRITE_AT与POSIX C read有何区别?

附录A:安装MS-MPI


A.1 下载与安装

通过

Microsoft MPI

下载msmpisetup.exe、msmpisetup.exe,然后依次安装。

安装包 默认安装路径
msmpisetup.exe C:\Program Files\Microsoft MPI
msmpisetup.exe C:\Program Files (x86)\Microsoft SDKs\MPI


A.2 环境配置

Visual Studio 2019中新建C++控制台项目,然后修改以下项目属性




A.3 测试

编写如下代码,

#include <iostream>

#include "mpi.h"

int main(const int argc,char **argv)
{
    ::MPI_Init(&argc, &argv);

    int myid;
    ::MPI_Comm_rank(MPI_COMM_WORLD, &myid);

    std::cout << "Hello World form " << myid << std::endl;

    //std::cout << "Hello World!\n";
    
    ::MPI_Finalize();

    return 0;
}

编译并进行运行,验证测试是否成功。

文献资料

Peter Pacheco. An Introduction to Parallel Programming. Elsevier, 2011.

张林波. MPI并行编程讲稿.  中国科学院数学与系统科学研究院, 2006.

都志辉. 高性能计算并行编程技术-MPI并行程序设计. 清华大学出版社, 2001.

张武生. MPI并行程序设计实例教程. 清华大学出版社, 2009.

Walfredo Cirne. On Interfaces to Parallel I/O Subsystems.

徐树方. 数值线性代数. 北京大学出版社, 2013.

网络资料




MPICH







Github MPICH



https://github.com/pmodels/mpich





OpenMPI



https://www.open-mpi.org/





Intel MPI Library



https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html?wapkw=mpi#gs.delzk3





DeinoMPI



http://mpi.deino.net/index.htm





Microsoft MPI



https://docs.microsoft.com/en-us/message-passing-interface/microsoft-mpi?redirectedfrom=MSDN





Libfabric



http://libfabric/





MPI系列: 并行IO性能优化究竟是怎么玩的呢?



https://blog.csdn.net/BtB5e6Nsu1g511Eg5XEg/article/details/88084041






MPI如何对Lustre/GPFS文件系统优化?



https://cloud.tencent.com/developer/news/399820