Recipe: Obtaining peak VM size in pure Fortran

Often in High Performance Computing one needs to know about the various memory metrics of a given program with the peak memory usage probably being the most important one. While the getrusage(2) syscall provides some of that information, it’s use in Fortran programs is far from optimal and there are lots of metrics that are not exposed by it.

On Linux one could simply parse the /proc/PID/status file. Being a simple text file it could easily be processed entirely with the built-in Fortran machinery as shown in the following recipe:

vmpeak.f90

program test
  integer :: vmpeak

  call get_vmpeak(vmpeak)
  print *, 'Peak VM size: ', vmpeak, ' kB'
end program test

!---------------------------------------------------------------!
! Returns current process' peak virtual memory size             !
! Requires Linux procfs mounted at /proc                        !
!---------------------------------------------------------------!
! Output: peak - peak VM size in kB                             !
!---------------------------------------------------------------!
subroutine get_vmpeak(peak)
  implicit none
  integer, intent(out) :: peak
  character(len=80) :: stat_key, stat_value
  !
  peak = 0
  open(unit=1000, name='/proc/self/status', status='old', err=99)
  do while (.true.)
    read(unit=1000, fmt=*, err=88) stat_key, stat_value
    if (stat_key == 'VmPeak:') then
      read(unit=stat_value, fmt='(I)') peak
      exit
    end if
  end do
88 close(unit=1000)
  if (peak == 0) goto 99
  return
  !
99 print *, 'ERROR: procfs not mounted or not compatible'
  peak = -1
end subroutine get_vmpeak

The code accesses the status file of the calling process /proc/self/status. The unit number is hard-coded which could present problems in some cases. Modern Fortran 2008 compilers support the NEWUNIT specifier and the following code could be used instead:

integer :: unitno

open(newunit=unitno, name='/proc/self/status', status='old', err=99)
! ...
close(unit=unitno)

With older compilers the same functionality could be simulated using the following code.

MPI programming basics

Embracing the current development in educational technologies, the IT Center of the RWTH Aachen University (former Center for Computing and Communication) makes available online the audio recordings of most tutorials delivered during this year’s PPCES seminar. Participation in PPCES is for free and course materials are available online, but this is the first time when proper audio recordings were taken.

All videos (presentation slides + audio) are available on the PPCES YouTube channel under Creative Commons Attribution license. Course materials are available in the PPCES 2014 archive under unclear (read: do not steal blatantly) license.

My own contribution to PPCES - as usual - consists of:

  • Message passing with MPI, part 1: Basic concepts and point-to-point communication

  • Message passing with MPI, part 2: Collective operations and often-used patterns

  • Tracing and profiling MPI applications with VampirTrace and Vampir

Big thanks to all the people who made recording and publishing the sessions possible.

Linear congruency considered harmful

Recently I stumbled upon this Stack Overflow question. The question author was puzzled with why he doesn’t see any improvement in the resultant value of \(\pi\) approximated using a parallel implementation of the well-known Monte Carlo method when he increase the number of OpenMP threads. His expectation was that, since the number of Monte Carlo trials that each thread performs was kept constant, adding more threads would increase linearly the sample size and therefore improve the precision of the approximation. He did not observe such improvement and blamed it on possible data races although all proper locks were in place. The question seems to be related to an assignment that he got at his university. What strikes me is the part of the assignment, which requires that he should use a specific linear congruential pseudo-random number generator (LCPRNG for short). In his case a terrible LCPRNG.

An inherent problem with all algorithmic pseudo-random number generators is that they are deterministic and only mimic randomness since each new output is a well-defined function of the previous output(s) (thus the pseudo- prefix). The more previous outputs are related together, the better the “randomness” of the output sequence could be made. Since the internal state can only be of a finite length, every now and then the generator function would map the current state to one of the previous ones. At that point the generator starts repeating the same output sequence again and again. The length of the unique part of the sequence is called the cycle length of the generator. The longer the cycle length, the better the PRNG.

Linear congruency is the worst method for generating pseudo-random numbers. The only reason it is still used is that it is extremely easy to be implemented, takes very small amount of memory, and it works acceptably well in some cases if the parameters are chosen wisely. It’s just that Monte Carlo simulations are rarely that cases. So what is the problem with LCPRNGs? The problem is that their output depends solely on the previous one as the congruential relation is

\begin{equation*} p_{i+1} \equiv (A \cdot p_i + B)\,(mod\,C), \end{equation*}

where \(A\), \(B\) and \(C\) are constants. If the initial state (the seed of the generator) is \(p_0\), then the i-th output is the result of \(i\) applications of the generator function \(f\) to the initial state, \(p_i = f^i(p_0)\). When it happens that an output repeats the initial state, i.e. \(p_N = p_0\) for some \(N > 0\), the generator loops since

\begin{equation*} p_{N+i} = f^{N+i}(p_0) = f^i(f^N(p_0)) = f^i(p_N) = f^i(p_0) = p_i. \end{equation*}

As is also true with the human society, short memory leads to history repeating itself in (relatively short) cycles.

The generator from the question uses \(C = 741025\) and therefore it produces pseudo-random numbers in the range \([0, 741024]\). For each test point two numbers are sampled consecutively from the output sequence, therefore a total of \(C^2\) or about 550 billion points are possible. Right? Wrong! The choice of parameters results in this particular LCPRNG having a cycles length of 49400, which is orders of magnitude worse than the otherwise considered bad ANSI C pseudo-random generator rand(). Since the cycle length is even, once the sequence folds over, the same set of 24700 points is repeated over and over again. The unique sequence covers \(49400/C\) or about 6,7% of the output range (which is already quite small).

A central problem in Monte Carlo simulations is the so called ergodicity or the ability of the simulated system to pass through all possible states. Because of the looping character of the LCPRNG and the very short cycle length, there are many states that remain unvisited and therefore the simulation exhibits really bad ergodicity.  Not only this, but the output space is partitioned into 16 (\(\lceil C/49400\rceil\)) disjoint sets and there are only 16 unique initial values (seeds) possible. Therefore only 32 different sets of points can be drawn from that generator (why 32 and not 16 is left as an exercise to the reader).

How this relates to the bad approximation of \(\pi\)? The method used in the question is a geometric approximation based on the idea that if a set of points \(\{ P_i \}\) is drawn randomly and uniformly from \([0, 1) \times [0, 1)\), the probably that such a point lies inside a unit circle centred at the origin of the coordinate system is \(\frac{\pi}{4}\). Therefore:

\begin{equation*} \pi \approx 4\frac{\sum_{i=1}^N \theta{}(P_i)}{N}, \end{equation*}

where \(\theta{}(P_i)\) is an indicator function that has a value of 1 for all points \(\{ P(x,y): x^2+y^2 \leq 1\}\) and 0 for all other points and \(N\) is the number of trials. Now it is well known that the precision of the approximation is proportional to \(1/\sqrt{N}\) and therefore more trials give better results. The problem in this case is that due to the looping nature of the LCPRNG, the sum in the nominator is simply \(m \times S_0\), where \(S_0 = \sum_{i=1}^{24700} \theta(P_i)\). For large \(N\) we have \(m \approx N/24700\) and therefore the approximation is stuck at the value of:

\begin{equation*} \tilde{\pi} = 4 \frac{\sum_{i=1}^{24700} \theta(P_i)}{24700}. \end{equation*}

It doesn’t matter if one samples 24700 points or if one samples 247000000 points. The result is going to be the same and the precision in the latter case is not going to be 100 times better but rather exactly the same as in the former case with 9999 times the computational resources used in the former case now effectively wasted.

Adding more threads could improve the precision if:

  • each thread has its own PRNG, i.e. the generator state is thread-private and not globally shared, and
  • the seed in each thread is chosen carefully so not to reproduce some other thread’s generator output.

It was already shown that there are at most 32 unique sets of points and therefore using only up to 32 threads makes sense with an expected 5,7-fold increase of the precision of the approximation (less than one decimal digit).

This leaves me scratching my head: was his docent grossly incompetent or did he deliberately gave him an exercise with such a bad PRNG so that he could learn how easily beautiful Monte Carlo methods are spoiled by bad pseudo-random generators?

It should be noted that having a cyclic PRNG is not necessarily a bad thing. Even if two different seed values result in the same unique sequence, they usually start the generator output at different positions in the sequence. And if the sample size is small relative to the cycle length (or respectively the cycle length is huge relative to the sample size), it would appear as if two independent sequences are being sampled. Not in this case though.

Some final words. Never use linear congruential PRNGs for Monte Carlo simulations! Ne-ver! Use something like Mersenne twister MT19937 instead. Also don’t try to reinvent RANDU with all its ill consequences to the simulation science. Thank you!

MPI Trace Art

The internal working of most MPI libraries is considered black magic by many. Indeed, the PMPI profiling interface, used by virtually all portable tracing and profiling libraries, treats all collective operations as black boxes and one only sees coloured polygons in the trace visualisation and all messages sent between the processes in order to implement a given collective operation remain hidden. Fortunately the most widely used general purpose MPI implementations come with openly accessible source codes and one can easily reimplement all collective algorithms using regular point-to-point MPI operations in order to be able to trace them and to analyse their performance. Some nice graphics could also be produced as a by-product. Surprisingly, it turns out that some of those traces resemble modern art. And thus the MPI trace art is born.

Motivated by some unusual behaviour of the MPI broadcast operation in Open MPI on RWTH’s compute cluster (unusually long completion time given a certain “magic” number of MPI processes), I reimplemented some of the broadcast algorithms from the tuned module of the coll component type and traced the result with VampirTrace. tuned is currently the collective communications module that gets selected for most cluster jobs unless one intervenes in the module selection process. It implements several different algorithms and selects between them using an empirically derived heuristic logic, unless a special file with dynamic rules has been provided. Here is what the default algorithm for broadcasting large messages to a large number of processes looks like in Vampir:

MPI_Bcast with segmented pipelining

MPI_Bcast with segmented pipelining

Each message is split into many segments of equal size (except the last one that could be shorter) and then a pipeline is built — the root rank process sends to the next rank, which sends to the rank after it, and so on. The change of slope between ranks 11 and 12 is a clear sign of inter-node communication with different latency and/or bandwidth. Since the InfiniBand network has higher latency than the shared memory, used for intra- node messaging, a build-up of messages is observed which leads to the compression of the message lines the further it goes in time. Another such slope change is present between ranks 23 and 24, but it does not lead to another bunching of message lines as they have already been spread out while crossing between ranks 11 and 12. The narrow polygon on the left side is an MPI_Barrier collective call and is an example of how opaque are the MPI collectives when seen through the PMPI interface.

Note the overall peaceful feeling, streaming from the communication structure — it almost looks like a laminar fluid flow. No surprise that this is the best performing broadcast algorithm, which is available in Open MPI, when it comes to large messages and a huge number of participating ranks.

A variation of this algorithm uses several pipelines to transport the segments:

MPI_Bcast with segmented chaining (4 chains)

MPI_Bcast with segmented chaining (4 chains)

The root rank feeds simultaneously several separate chains (or short pipelines) — four in this specific case. As the time between two consecutive segments are sent to the same chain is more than the time it takes for the segments to traverse the InfiniBand link between ranks 11 and 12, no bunching of message lines is observed as in the case with the full pipeline.

There are also algorithms that communicate messages over a tree structure. They make for less pretty and more “angry” looking pictures.

MPI_Bcast with segmented binary tree distribution

MPI_Bcast with segmented binary tree distribution

MPI_Bcast with segmented binomial tree distribution

MPI_Bcast with segmented binomial tree distribution

Communication in the top picture follows a binary tree pattern while that in the bottom one – a binomial tree. Both trees differ in the distribution of process ranks among the nodes of the tree and in their breadth/depth given the same number of ranks. Although the overall message line density looks nearly the same (especially when looked at from a distance), on close inspection one can see that the “rays” of messages near the bottom actually follow completely different patterns.

Besides being pretty, these traces are instructive too. Many times the really performant implementations of MPI collectives are quite complicated and not always obvious. That’s why it is best to stick with the vendor provided collectives and not to try to reimplement them in your own code (unless one reimplements them for debugging or research purposes).

More trace art is coming soon.