This lesson is being piloted (Beta version)

Introduction to Parallel Computing

Overview

Teaching: 25 min
Exercises: 20 min
Questions
  • How does MPI work?

Objectives
  • Introduce parallel computing.

  • Use mpicc/mpifort and mpirun.

Display Language  

Parallel Computing

In many fields of science we need more computational power than a single core on a single processor can provide. In essence, parallel computing means using more than one computer (or more than one core) to solve a problem faster.

Naively, using more CPUs (or cores) means that one can solve a problem much faster, in time scales that make sense for research projects or study programs. However, how efficiently a problem can be solved in parallel depends on how the problem is divided among available CPUs (or cores), and the data and memory needs of the algorithm. This determines how much the cores need to communicate with each other while working together. The number of cores we can efficiently use also depends on the problem. You probably have two or four cores on your laptop, and many problems can be run very efficiently on all of those cores. As a researcher, you probably have access to a High-Performance Computing (HPC) system with thousands or hundreds of thousands of cores. To use them all efficiently would be challenging in almost any field.

Also, if not careful, sometimes running in parallel can give a wrong result. Consider the sum, 1 - 1 + 1 - 1 + 1 … Depending on how this summing is performed on multiple CPUs (or cores), the final answer is different. In practice, since there are always round-off errors in numerical calculations and the order of numerical calculations in parallel computing can be different, the result from running on a serial program on one CPU can be different from the result from running a parallel program on multiple CPUs.

During this course you will learn to design parallel algorithms and write parallel programs using the MPI library. MPI stands for Message Passing Interface, and is a low level, minimal and extremely flexible set of commands for communicating between copies of a program.

Using MPI

Running with mpirun

Let’s get our hands dirty from the start and make sure MPI is installed. Open a bash command line and run the command:

echo Hello World!

Now use mpirun to run the same command:

mpirun -n 4 echo Hello World!

What did mpirun do?

If mpirun is not found, try mpiexec instead. This is another common name for the command.

MPI on HPC

HPC clusters typically have more than one version of MPI available, so you may need to tell it which one you want to use before it will give you access to it.

Typically, a command like module load mpi will give you access to an MPI library, but ask a helper or consult your HPC facility’s documentation if you’re not sure how to use MPI on a particular cluster.

Note for Cygwin users

If you use Cygwin, you may notice an error message saying There are not enough slots available in the system. By default, Cygwin will prevent you from running more copies than you have cores to run on. Add --oversubscribe after mpirun to run more copies than you have cores.

Just running a program with mpirun starts several copies of it. The number of copies is decided by the -n parameter. In the example above, the program does not know it was started by mpirun and each copy just works as if they were the only one.

For the copies to work together, they need to know about their role in the computation. This usually also requires knowing the total number of tasks running at the same time.

To achieve this, the program needs to call the MPI_Init function. This will set up the environment for MPI, and assign a number (called the rank) to each process. At the end, each process should also cleanup by calling MPI_Finalize.

int MPI_Init(&argc, &argv);
int MPI_Finalize();

Both MPI_Init and MPI_Finalize return an integer. This describes errors that may happen in the function. Usually we will return the value of MPI_Finalize from the main function

integer ierr
call MPI_Init(ierr);
call MPI_Finalize(ierr);

Both function take the parameter ierr, which is overwritten by the number of any error encountered, or by 0 if there is no error.

In MPI for Python (mpi4py), the initialization and finalization of MPI are handled by the library, and the user can perform MPI calls after from mpi4py import MPI.

from mpi4py import MPI

After MPI is initialized, you can find out the rank of the copy using the MPI_Comm_rank function in C and Fortran, or the Get_rank method in Python. :

int rank;
int MPI_Comm_rank(MPI_COMM_WORLD, &rank);
integer rank
call MPI_Comm_rank(MPI_COMM_WORLD,rank,ierr)
rank = MPI.COMM_WORLD.Get_rank()

Here’s a more complete example:

 #include <stdio.h>
 #include <mpi.h>

 int main(int argc, char** argv) {
   int rank;

   // First call MPI_Init
   MPI_Init(&argc, &argv);
   // Get my rank
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);

   printf("My rank number is %d\n", rank);

   // Call finalize at the end
   return MPI_Finalize();
 }
program hello

    implicit none
    include "mpif.h"

    integer rank, ierr

    call MPI_Init(ierr)
    call MPI_Comm_rank(MPI_COMM_WORLD,rank,ierr)
    write(6,*) "My rank number is ", rank
    call MPI_Finalize(ierr)

end
from mpi4py import MPI

rank = MPI.COMM_WORLD.Get_rank()
print("My rank number is", rank)

Fortran Standard

Fortran examples, exercises and solutions in this workshop will conform to the Fortran 2008 standard. The standard can be specified to the compiler in two ways:

  • using the .F08 file extension.
  • adding -std=f2008 on the command line.

Fortran 2008 should be readable to those familiar with an earlier standard.

The Fortran 2008 standard contains the use statement, which is a more native way of referring to modules and can replace the include statement borrowed from C. You can try replacing include "mpi.h" with use mpi, which should come before the implicit none statement. However, as some versions of the MPI library do not support Fortran modules, we will stick with the include statement in the examples.

If you try to compile these examples with your usual C or Fortran compiler, then you will receive an error, since these do not link to the MPI libraries that provide the new functions that we have introduced. Since the set of compiler flags that would be necessary for this tends to be quite long, MPI libraries such as OpenMPI provide compiler wrappers, which set up the correct environment before invoking your compiler of choice. The wrapper for C compilers is usually called mpicc, while the wrapper for Fortran is usually called mpifort. These can be called in exactly the same way as your usual C compiler, e.g. gcc and gfortran, respectively. Any options not recognised by the MPI wrapper are passed through to the non-MPI compiler.

Compile and Run

Compile the above C code with mpicc, or Fortran code with mpiifort. Python code does not need compilation. Run the code with mpirun.

Compiling C++

While the examples and exercises in these lessons are written in C, the MPI functions will work with the same syntax in C++. The Boost library also provides a C++ interface to MPI.

In order to compile C++ code with MPI, use mpicxx instead of mpicpp.

Communicators

The MPI_COMM_WORLD parameter is an MPI communicator, which was created by MPI_Init. It labels the set of cores we are working with, called ranks, and provides a context for communications between them. You can also create your own communicators, which contain a subset of the MPI ranks. Because of this, most MPI functions require you to specify the communicator you want them to operate on—for this lesson, this will always be MPI_COMM_WORLD.

Notice that we are still only coding for a single copy of the program. This is always true when using MPI. Each copy runs the same code and only differs by its rank number. The best way to think about writing MPI code is to focus on what a single rank needs to be doing. When all ranks are doing their job, the algorithm will work correctly.

Usually the rank will need to know how many other ranks there are. You can find this out using the MPI_Comm_size function in C and Fortran, or the Get_size method in Python.

int n_ranks;
int MPI_Comm_size(MPI_COMM_WORLD, &n_ranks);
integer n_ranks
MPI_Comm_size(MPI_COMM_WORLD, n_ranks);
n_ranks = MPI.COMM_WORLD.Get_size()

Hello World!

Hello World!

Use the MPI library to write a “Hello World” program. Each copy of the program, or rank, should print “Hello World!” followed by its rank number. They should also print the total number of ranks.

Solution

#include <stdio.h>
#include <mpi.h>

int main(int argc, char** argv) {
  int rank, n_ranks;

  // First call MPI_Init
  MPI_Init(&argc, &argv);
  // Get my rank and the number of ranks
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &n_ranks);

  printf("Hello World! I'm rank %d\n", rank);
  printf("Total no. of ranks = %d\n", n_ranks);

  // Call finalize at the end
  return MPI_Finalize();
}

Solution

program hello

    implicit none
    include "mpif.h"
    
    integer rank, n_ranks, ierr

    call MPI_Init(ierr)
    call MPI_Comm_rank(MPI_COMM_WORLD,rank,ierr)
    call MPI_Comm_size(MPI_COMM_WORLD,n_ranks,ierr)
    write(6,*) "Hello World! I'm rank ", rank
    write(6,*) "Total no. of ranks = ", n_ranks
    call MPI_Finalize(ierr)
end

Solution

from mpi4py import MPI

rank = MPI.COMM_WORLD.Get_rank()
n_ranks = MPI.COMM_WORLD.Get_size()

print("Hello World! I'm rank", rank);
print("Total no. of ranks =", n_ranks);

Parallel loop

The following serial code contains a loop and in each iteration some work is performed (in this simple case, simply printing the iteration variable).

#include <stdio.h>

int main(int argc, char** argv) {
  int numbers = 10;

  for( int i=0; i<numbers; i++ ) {
    printf("I'm printing the number %d.\n", i);
  }

}
program printnumbers
   implicit none
   
   integer numbers, number
   numbers = 10

   do number = 0, numbers - 1
        write(6,*) "I'm printing the number", number
   end do
end
numbers = 10

for number in range(numbers):
    print("I'm printing the number", number)

In the parallelized version below, the iterations of the loop are split among parallel processes, but some parts of the code are missing. Study the code and fill in the blanks so that each iteration is run by only one rank, and each rank has more or less the same amount of work.

#include <stdio.h>
#include <math.h>
#include <mpi.h>

int main(int argc, char** argv) {
 int rank, n_ranks, numbers_per_rank;
 int my_first, my_last;
 int numbers = 10;

 // First call MPI_Init
 MPI_Init(&argc, &argv);
 // Get my rank and the number of ranks
 MPI_Comm_rank(MPI_COMM_WORLD,&rank);
 MPI_Comm_size(MPI_COMM_WORLD,&n_ranks);

 // Calculate the number of iterations for each rank
 numbers_per_rank = floor(____/____);
 if( numbers%n_ranks > 0 ){
    // Add 1 in case the number of ranks doesn't divide the number of numbers
    ____ += 1;
 }

 // Figure out the first and the last iteration for this rank
 my_first = ____ * ____;
 my_last = ____ + ____;

 // Run only the part of the loop this rank needs to run
 // The if statement makes sure we don't go over
 for( int i=my_first; i<my_last; i++ ) {
   if( i < numbers ) {
     printf("I'm rank %d and I'm printing the number %d.\n", rank, i);
   }
 }

 // Call finalize at the end
 return MPI_Finalize();
}
program print_numbers

   implicit none
   include "mpif.h"

   integer numbers, number
   integer rank, n_ranks, ierr
   integer numbers_per_rank, my_first, my_last
   numbers = 10

   ! Call MPI_Init
   call MPI_Init(ierr)

   ! Get my rank and the total number of ranks
   call MPI_Comm_rank(MPI_COMM_WORLD,rank,ierr)
   call MPI_Comm_size(MPI_COMM_WORLD,n_ranks,ierr)

   ! Calculate the number of iterations for each rank
   numbers_per_rank = ____/____
   if (MOD(numbers, n_ranks) > 0) then
        ! add 1 in case the number of ranks doesn't divide the number of numbers
        ____ = ____ + 1
   end if

   ! Figure out the first and the last iteration for this rank
   my_first = ____ * ____
   my_last = ____ + ____

   ! Run only the necessary part of the loop, making sure we don't go over
   do number = my_first, my_last - 1
        if (number < numbers) then
             write(6,*) "I'm rank", rank, " and I'm printing the number", number
        end if
   end do


   call MPI_Finalize(ierr)
end
from mpi4py import MPI

numbers = 10

rank = MPI.COMM_WORLD.Get_rank()
n_ranks = MPI.COMM_WORLD.Get_size()

numbers_per_rank = ____ // ____
if numbers % n_ranks > 0:
    ____ += 1

my_first = ____ * ____
my_last = ____ + ____

for i in range(my_first, my_last):
    if i < numbers:
        print("I'm rank {:d} and I'm printing the number {:d}.".format(rank, i))

Solution

#include <stdio.h>
#include <math.h>
#include <mpi.h>

int main(int argc, char** argv) {
 int rank, n_ranks, numbers_per_rank;
 int my_first, my_last;
 int numbers = 10;

 // First call MPI_Init
 MPI_Init(&argc, &argv);
 // Get my rank and the number of ranks
 MPI_Comm_rank(MPI_COMM_WORLD,&rank);
 MPI_Comm_size(MPI_COMM_WORLD,&n_ranks);

 // Calculate the number of iterations for each rank
 numbers_per_rank = floor(numbers/n_ranks);
 if( numbers%n_ranks > 0 ){
    // Add 1 in case the number of ranks doesn't divide the number of numbers
    numbers_per_rank += 1;
 }

 // Figure out the first and the last iteration for this rank
 my_first = rank * numbers_per_rank;
 my_last = my_first + numbers_per_rank;

 // Run only the part of the loop this rank needs to run
 // The if statement makes sure we don't go over
 for( int i=my_first; i<my_last; i++ ) {
   if( i < numbers ) {
     printf("I'm rank %d and I'm printing the number %d.\n", rank, i);
   }
 }

 // Call finalize at the end
 return MPI_Finalize();
}

Solution

program print_numbers

   implicit none
   include "mpif.h"

   integer numbers, number
   integer rank, n_ranks, ierr
   integer numbers_per_rank, my_first, my_last
   numbers = 10

   ! Call MPI_Init
   call MPI_Init(ierr)

   ! Get my rank and the total number of ranks
   call MPI_Comm_rank(MPI_COMM_WORLD,rank,ierr)
   call MPI_Comm_size(MPI_COMM_WORLD,n_ranks,ierr)

   ! Calculate the number of iterations for each rank
   numbers_per_rank = numbers/n_ranks
   if (MOD(numbers, n_ranks) > 0) then
        ! add 1 in case the number of ranks doesn't divide the number of numbers
        numbers_per_rank = numbers_per_rank + 1
   end if

   ! Figure out the first and the last iteration for this rank
   my_first = rank * numbers_per_rank
   my_last = my_first + numbers_per_rank

   ! Run only the necessary part of the loop, making sure we don't go over
   do number = my_first, my_last - 1
        if (number < numbers) then
             write(6,*) "I'm rank", rank, " and I'm printing the number", number
        end if
   end do


   call MPI_Finalize(ierr)
end

Solution

from mpi4py import MPI

numbers = 10

rank = MPI.COMM_WORLD.Get_rank()
n_ranks = MPI.COMM_WORLD.Get_size()

numbers_per_rank = numbers // n_ranks
if numbers % n_ranks > 0:
    numbers_per_rank += 1

my_first = rank * numbers_per_rank
my_last = my_first + numbers_per_rank

for i in range(my_first, my_last):
    if i < numbers:
        print("I'm rank {:d} and I'm printing the number {:d}.".format(rank, i))

In lesson 3 you will learn how to communicate between the ranks. From there, you can in principle write general parallel programs. The trick is in designing a working, fast and efficient parallel algorithm for your problem.

Key Points

  • Many problems can be distributed across several processors and solved faster.

  • mpirun runs copies of the program.

  • The copies are separated MPI rank.