OpenMPI with Fortran


Introduction to OpenMPI

Fortran and OpenMPI

scientific-computing MPI fortran


This post is a short tutorial on OpenMPI. The main goal is to be familiar with the general concepts and terminology used on MPI parallel programming. The examples of code shown in this tutorial are Fortran 90.

Framework: First, we have to realize that there are three important elements on parallel programing in cluster systems: hardware, software and the topology of the network. This tutorial is focused mainly on clusters that are build of nodes. This means that each node does NOT have direct access to the other node memory (we have a distributed memory and not common shared memory). The consequence is that we will need to communicate between nodes, therefore send and collect information/data from different machines, that is why it is used MPI directives. However note that the code produced with MPI can be used on a single processor or in a multicore machine.


Flynn’s Taxonomy

  • SISD: single instruction single data (serial computer)
  • SIMD: single instruction multiple data (vector computer)
  • MISD: multiple instruction single data (uncommon)
  • MIMD: multiple instruction multiple data (parallel computer).

We will use the terminology SPMD to mean single program multiple data. The idea is a commom code runned on several computers but each node process different data (the data of each node should be distribuited by the single code). MPI is a realization of the SPMD by implementing several directives. In the practical Fortran 90 programming it is a module called with USE MPI and it implement several subroutines that manage the SPMD idea. The typical workflow is build a source code, compile it (in the case of Fortran 90 or C/C++ etc), and run in a machine indicating the number of tasks (or nodes) that will be used (for instance mpirun -np 123 mycode.exe).

  • FLOPS: It is the number of floating point operations per second. It is used to describe the computation capability.
  • Serial time: time to run the code in 1 machine (serial code): $t_{s}$
  • Parallel time: time to process the same code with p processors: $t_{p}$
  • Speedup factor: $S(p)=t_{s}/t_{p}$
  • Work cost: $W(p)=pt_{p}$
  • Efficiency: $E(p)=t_{s}/(pt_{p})=W(1)/W(p)$
  • Serial fraction: f, is the fraction of the code that is allways serial (it is allways runned by only 1 processor).
  • Overhead: $(Wp-W1)/W1$
  • Computation time: fraction of time running the code ($t_{c}$)
  • Comunication time: fraction of the time performing transmission or comunication between nodes and memory ($t_{t}$)

Note: it is important maximize $t_{c}/t_{t}$

Amdahl’s law

It measures the parallel performance given the serial performance. \[t_{p}=ft_{s}+(1-f)\frac{t_{s}}{p}\] \[S(p)=\frac{p}{f(p-1)+1}\]

Guftafson’s law

It measures the serial performance given the parallel performance. \[t_{s}=ft_{p}+(1-f)pt_{p}\] \[S(p)=f+(1-f)p\]

Network Latency and Load-Balance

This is another important concept: in a parallel program over a cluster with several nodes communicated by a network the time spent in communication have to be estimated. In general the time needed to send a message depends on the size of the message plus the time to establish the communication (like open the channel) called latency. Therefore, $t_{c}=L+m/B$, where L is the latency time, B is the bandwidth and m is the size of the message.

In the same size it is important achieve a work balance between the nodes, otherwise the computation time is constrained by the performance of one node.

Waterfall Model in Scientific Computing

In general the software is developed following the so called Waterfall Model

Waterfall Model

In the case of scientific programming we could modificate the initial stages of the waterfall model then the design is changed by:

Abstraction of the problem >(1)> Physical/Mathematical Model


Physical/Mathematical Model >(2)> Formulation as an algorthim

However in the transition (2) is better to know in advance properties of the hardware and tolopogy to consolidate the design of the software. For these reasons a previous good knownledge of the parallel programming methodologies. Here it is explained OpenMPI

MPI and OpenMPI

MPI is a standard set of directives that several tools and packages can follow (another is for instance PVM, parallel virtual machine). Here it is explained OpenMPI (but there are others like MPICH). I will explain the directives by examples, for that I recommend to the reader a UNIX/LINUX machine, an editor like geany or vim, an installed version of OpenMPI and the fortran compiler. A standard linux distribution gives to the user an easy way to install all this tools.

The codes here included can be compiled and executed as:

mpif90 fortran90mpi.f90 -o myprogram.exe

mpirun -np 4 myprogram.exe

Fortran Examples

Hello from each node

program main
    use mpi
    integer :: ierr, nproc, iproc
    call MPI_INIT(ierr)
    call MPI_COMM_RANK(MPI_COMM_WORLD, iproc, ierr)
    call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr)
    print *, 'Hello World! by ',iproc,'from ',nproc
    call MPI_FINALIZE(ierr)
end program main

Different code for each node

program main
  use mpi
  implicit none
  integer num_id, min_process, ierr, status(MPI_STATUS_SIZE)
  integer num_procs
  call MPI_INIT (ierr)
  call MPI_COMM_RANK (MPI_COMM_WORLD, num_id, ierr)
  call MPI_COMM_SIZE (MPI_COMM_WORLD, num_procs, ierr)
  if ( num_id .eq. 0 ) then   
      print *,'Machine ',num_id, 'I am doing task 0'
  elseif (num_id .eq. 1 ) then
      print *,'Machine ',num_id, 'I am doing task 1'
  elseif (num_id .eq. 2 ) then
      print *,'Machine ',num_id, 'I am doing task 2'
      print *,'Machine ',num_id, 'I am doing a common task 3'
  call MPI_FINALIZE(ierr)
end program main

Example with send and recive information

Here is explained the design of a code. It shows a possible stragegy to program a specific code by writting in natural language the logical task that should be done. This example was taken from

program sumvector
!! The code have to sum the rows in a vector. The root process will be
!! here a master that send parts of the vector to each child process
!! (often called slave). Master and child processes calculate
!! the partial sum of the portion of the vector assigned.
!! The child processes send their partial sums to the master, who calculates 
!! the total sum
use mpi
integer num_id, main_process, ierr, status(MPI_STATUS_SIZE)
integer num_procs, an_id
main_process = 0
c Now replicate this process to create parallel processes.  
  call MPI_INIT (ierr)
c find out MY process ID, and how many processes were started.
  call MPI_COMM_RANK (MPI_COMM_WORLD, my_id, ierr)
  call MPI_COMM_SIZE (MPI_COMM_WORLD, num_procs, ierr)
  if (my_id .eq. root_process) then   
c   I must be the root process, so I will query the user
c   to determine how many numbers to sum.
c   1. Initialize a vector,
c   2. Distribute a portion of the vector to each child process,
c   3. Calculate the sum of the values in the segment assigned,
c   4. Collect the partial sums from slave processes,
c   5. Print them, and add them to the grand sum, and print it.
c   1. I must be slave process, so I must receive my vector segment,
c   2. Calculate the sum of my portion of the vector,
c   3. and, finally, send my portion of the sum to the root process.
c   Stop this process.
  call MPI_FINALIZE(ierr)