Introduction to OpenMPI
COMPUTING-BLOG
Scientific Computing MPI Fortran
Notebook
Table of Contents
Introduction
This post is a very short tutorial on OpenMPI1. The main goal is to be familiar with the general concepts and terminology used on MPI parallel programming. The examples of code shown in this tutorial are Fortran 90.
Framework: First, we have to realize that there are three important elements on parallel programing in cluster systems: hardware, software and the topology of the network. This tutorial is focused mainly on clusters that are build of nodes. This means that each node does NOT have direct access to the other node memory (we have a distributed memory and not common shared memory). The consequence is that we will need to communicate between nodes, and therefore we have to send and to collect information/data from different machines, that is the reason to use MPI (which means Message Passing Interface) directives. However note that the code produced with MPI can be used on a single processor or in a multicore machine.
Definitions
Hardware: Flynn’s Taxonomy
- SISD: single instruction single data (serial computer)
- SIMD: single instruction multiple data (vector computer)
- MISD: multiple instruction single data (uncommon)
- MIMD: multiple instruction multiple data (parallel computer).
We will use the terminology SPMD to mean single program multiple data. The idea is a commom code runned on several computers but each node process different data (the data of each node should be distribuited by the single code). MPI is a realization of the SPMD by implementing several directives. In the practical Fortran 90 programming it is a module called with use mpi and it implement several subroutines that manage the SPMD idea. The typical workflow is build a source code, compile it (in the case of Fortran 90 or C/C++ etc), and run in a machine indicating the number of tasks (or nodes) that will be used (for instance mpirun -np 123 mycode.exe).
Example: A typical example in Atmospheric Physics is a climate model where the atmosphere is divided in cells. A set of the physical processes, known as parameterizations schemes (at grid-cell scale) are treated independenly for each grid-cell, therefore each node can calculate them independenly. Once all these calculations are performed the results can be shared for those processes that can not be resolved at grid-cell scale. As we can see there are portions of the code that are parallel but other portions of the code are serial, also we realized that the computer has to spend time sharing data/results between the different nodes. More formal definitions are given in the next sections.
Software: Performance related concepts
- FLOPS: It is the number of floating point operations per second. It is used to describe the computation capability.
- Serial time: time to run the code in 1 machine (serial code): \(t_{s}\)
- Parallel time: time to process the same code with p processors: \(t_{p}\)
- Speedup factor: \(S(p)=t_{s}/t_{p}\)
- Work cost: \(W(p)=pt_{p}\)
- Efficiency: \(E(p)=t_{s}/(pt_{p})=W(1)/W(p)\)
- Serial fraction: f, is the fraction of the code that is allways serial (it is allways runned by only 1 processor).
- Overhead: \((Wp-W1)/W1\)
- Computation time: fraction of time running the code (\(t_{c}\))
- Comunication time: fraction of the time performing transmission or comunication between nodes and memory (\(t_{t}\))
Note: it is important maximize $t_{c}/t_{t}$
Amdahl’s law
It measures the parallel performance given the serial performance. \[t_{p}=ft_{s}+(1-f)\frac{t_{s}}{p}\] \[S(p)=\frac{p}{f(p-1)+1}\]
Guftafson’s law
It measures the serial performance given the parallel performance. \[t_{s}=ft_{p}+(1-f)pt_{p}\] \[S(p)=f+(1-f)p\]
Network Latency and Load-Balance
This is another important concept: in a parallel program over a cluster with several nodes communicated by a network the time spent in communication have to be estimated. In general the time needed to send a message depends on the size of the message plus the time to establish the communication (like open the channel) called latency. Therefore, \(t_{c}=L+m/B\), where L is the latency time, B is the bandwidth and m is the size of the message.
In the same size it is important achieve a work balance between the nodes, otherwise the computation time is constrained by the performance of one node.
Waterfall Model in Scientific Computing
In general the software is developed following the so called Waterfall Model2. In the case of scientific programming we could modificate the initial stages of the waterfall model so the design is changed by:
Abstraction of the problem >–(1)–> Physical/Mathematical Model
and
Physical/Mathematical Model >–(2)–> Formulation as an algorthim
In the transition (2) is better to know in advance properties of the hardware and tolopogy to consolidate the design of the software. For these reasons a previous good knownledge of the parallel programming methodologies. Here it is commented very briefly the OpenMPI.
MPI and OpenMPI
MPI is a standard set of directives that several tools and packages can follow (another is for instance PVM, parallel virtual machine). Here it is explained OpenMPI (but there are others like MPICH). I will explain the directives by examples, for that I recommend to the reader a UNIX/LINUX machine, an editor like geany or vim, an installed version of OpenMPI and the fortran compiler. A standard linux distribution gives to the user an easy way to install all this tools.
The codes here included can be compiled and executed as:
Fortran Examples
Hello from each node
Different code for each node
Example with send and recive information
Here is explained the design of a code. It shows a possible stragegy to program a specific code by writting in natural language the logical task that should be done. This example was taken from
http://condor.cc.ku.edu/~grobe/docs/intro-MPI.shtml