Introduction to OpenMPI

COMPUTING-BLOG
Scientific Computing MPI Fortran

Notebook

Fortran and OpenMPI

Introduction

This post is a very short tutorial on OpenMPI1. The main goal is to be familiar with the general concepts and terminology used on MPI parallel programming. The examples of code shown in this tutorial are Fortran 90.

Framework: First, we have to realize that there are three important elements on parallel programing in cluster systems: hardware, software and the topology of the network. This tutorial is focused mainly on clusters that are build of nodes. This means that each node does NOT have direct access to the other node memory (we have a distributed memory and not common shared memory). The consequence is that we will need to communicate between nodes, and therefore we have to send and to collect information/data from different machines, that is the reason to use MPI (which means Message Passing Interface) directives. However note that the code produced with MPI can be used on a single processor or in a multicore machine.

Definitions

Hardware: Flynn’s Taxonomy

  • SISD: single instruction single data (serial computer)
  • SIMD: single instruction multiple data (vector computer)
  • MISD: multiple instruction single data (uncommon)
  • MIMD: multiple instruction multiple data (parallel computer).

We will use the terminology SPMD to mean single program multiple data. The idea is a commom code runned on several computers but each node process different data (the data of each node should be distribuited by the single code). MPI is a realization of the SPMD by implementing several directives. In the practical Fortran 90 programming it is a module called with use mpi and it implement several subroutines that manage the SPMD idea. The typical workflow is build a source code, compile it (in the case of Fortran 90 or C/C++ etc), and run in a machine indicating the number of tasks (or nodes) that will be used (for instance mpirun -np 123 mycode.exe).

Example: A typical example in Atmospheric Physics is a climate model where the atmosphere is divided in cells. A set of the physical processes, known as parameterizations schemes (at grid-cell scale) are treated independenly for each grid-cell, therefore each node can calculate them independenly. Once all these calculations are performed the results can be shared for those processes that can not be resolved at grid-cell scale. As we can see there are portions of the code that are parallel but other portions of the code are serial, also we realized that the computer has to spend time sharing data/results between the different nodes. More formal definitions are given in the next sections.

  • FLOPS: It is the number of floating point operations per second. It is used to describe the computation capability.
  • Serial time: time to run the code in 1 machine (serial code): \(t_{s}\)
  • Parallel time: time to process the same code with p processors: \(t_{p}\)
  • Speedup factor: \(S(p)=t_{s}/t_{p}\)
  • Work cost: \(W(p)=pt_{p}\)
  • Efficiency: \(E(p)=t_{s}/(pt_{p})=W(1)/W(p)\)
  • Serial fraction: f, is the fraction of the code that is allways serial (it is allways runned by only 1 processor).
  • Overhead: \((Wp-W1)/W1\)
  • Computation time: fraction of time running the code (\(t_{c}\))
  • Comunication time: fraction of the time performing transmission or comunication between nodes and memory (\(t_{t}\))

Note: it is important maximize $t_{c}/t_{t}$

Amdahl’s law

It measures the parallel performance given the serial performance. \[t_{p}=ft_{s}+(1-f)\frac{t_{s}}{p}\] \[S(p)=\frac{p}{f(p-1)+1}\]

Guftafson’s law

It measures the serial performance given the parallel performance. \[t_{s}=ft_{p}+(1-f)pt_{p}\] \[S(p)=f+(1-f)p\]

Network Latency and Load-Balance

This is another important concept: in a parallel program over a cluster with several nodes communicated by a network the time spent in communication have to be estimated. In general the time needed to send a message depends on the size of the message plus the time to establish the communication (like open the channel) called latency. Therefore, \(t_{c}=L+m/B\), where L is the latency time, B is the bandwidth and m is the size of the message.

In the same size it is important achieve a work balance between the nodes, otherwise the computation time is constrained by the performance of one node.

Waterfall Model in Scientific Computing

In general the software is developed following the so called Waterfall Model2. In the case of scientific programming we could modificate the initial stages of the waterfall model so the design is changed by:

Abstraction of the problem >–(1)–> Physical/Mathematical Model

and

Physical/Mathematical Model >–(2)–> Formulation as an algorthim

In the transition (2) is better to know in advance properties of the hardware and tolopogy to consolidate the design of the software. For these reasons a previous good knownledge of the parallel programming methodologies. Here it is commented very briefly the OpenMPI.

MPI and OpenMPI

MPI is a standard set of directives that several tools and packages can follow (another is for instance PVM, parallel virtual machine). Here it is explained OpenMPI (but there are others like MPICH). I will explain the directives by examples, for that I recommend to the reader a UNIX/LINUX machine, an editor like geany or vim, an installed version of OpenMPI and the fortran compiler. A standard linux distribution gives to the user an easy way to install all this tools.

The codes here included can be compiled and executed as:

mpif90 fortran90mpi.f90 -o myprogram.exe

mpirun -np 4 myprogram.exe

Fortran Examples

Hello from each node

program main
    use mpi
    integer :: ierr, nproc, iproc
    call MPI_INIT(ierr)
    call MPI_COMM_RANK(MPI_COMM_WORLD, iproc, ierr)
    call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr)
    print *, 'Hello World! by ',iproc,'from ',nproc
    call MPI_FINALIZE(ierr)
end program main

Different code for each node

program main
  use mpi
  implicit none
  integer num_id, min_process, ierr, status(MPI_STATUS_SIZE)
  integer num_procs
  call MPI_INIT (ierr)
  call MPI_COMM_RANK (MPI_COMM_WORLD, num_id, ierr)
  call MPI_COMM_SIZE (MPI_COMM_WORLD, num_procs, ierr)
  if ( num_id .eq. 0 ) then   
      print *,'Machine ',num_id, 'I am doing task 0'
  elseif (num_id .eq. 1 ) then
      print *,'Machine ',num_id, 'I am doing task 1'
  elseif (num_id .eq. 2 ) then
      print *,'Machine ',num_id, 'I am doing task 2'
  else
      print *,'Machine ',num_id, 'I am doing a common task 3'
  endif
  call MPI_FINALIZE(ierr)
end program main

Example with send and recive information

Here is explained the design of a code. It shows a possible stragegy to program a specific code by writting in natural language the logical task that should be done. This example was taken from

http://condor.cc.ku.edu/~grobe/docs/intro-MPI.shtml

program sumvector
 
!! The code have to sum the rows in a vector. The root process will be
!! here a master that send parts of the vector to each child process
!! (often called slave). Master and child processes calculate
!! the partial sum of the portion of the vector assigned.
!! The child processes send their partial sums to the master, who calculates 
!! the total sum
      
use mpi
integer num_id, main_process, ierr, status(MPI_STATUS_SIZE)
integer num_procs, an_id
main_process = 0
! Now replicate this process to create parallel processes.  
  call MPI_INIT (ierr)
! find out MY process ID, and how many processes were started.
  call MPI_COMM_RANK (MPI_COMM_WORLD, my_id, ierr)
  call MPI_COMM_SIZE (MPI_COMM_WORLD, num_procs, ierr)
  if (my_id .eq. root_process) then   
!   I must be the root process, so I will query the user
!   to determine how many numbers to sum.
!   1. Initialize a vector,
!   2. Distribute a portion of the vector to each child process,
!   3. Calculate the sum of the values in the segment assigned,
!   4. Collect the partial sums from slave processes,
!   5. Print them, and add them to the grand sum, and print it.
  else
!   1. I must be slave process, so I must receive my vector segment,
!   2. Calculate the sum of my portion of the vector,
!   3. and, finally, send my portion of the sum to the root process.
  endif
!   Stop this process.
  call MPI_FINALIZE(ierr)
  stop
  end
  • [1]: [OpenMPI web](https://www.open-mpi.org/)
  • [2]: [Waterfall Model](http://codeloop.site88.net/se/images/waterfall.jpg)