MPI Trace Art

The internal working of most MPI libraries is considered black magic by many. Indeed, the PMPI profiling interface, used by virtually all portable tracing and profiling libraries, treats all collective operations as black boxes and one only sees coloured polygons in the trace visualisation and all messages sent between the processes in order to implement a given collective operation remain hidden. Fortunately the most widely used general purpose MPI implementations come with openly accessible source codes and one can easily reimplement all collective algorithms using regular point-to-point MPI operations in order to be able to trace them and to analyse their performance. Some nice graphics could also be produced as a by-product. Surprisingly, it turns out that some of those traces resemble modern art. And thus the MPI trace art is born.

Motivated by some unusual behaviour of the MPI broadcast operation in Open MPI on RWTH’s compute cluster (unusually long completion time given a certain “magic” number of MPI processes), I reimplemented some of the broadcast algorithms from the tuned module of the coll component type and traced the result with VampirTrace. tuned is currently the collective communications module that gets selected for most cluster jobs unless one intervenes in the module selection process. It implements several different algorithms and selects between them using an empirically derived heuristic logic, unless a special file with dynamic rules has been provided. Here is what the default algorithm for broadcasting large messages to a large number of processes looks like in Vampir:

/images/coll_tuned_bcast_intra_pipeline.thumbnail.png

MPI_Bcast with segmented pipelining

Each message is split into many segments of equal size (except the last one that could be shorter) and then a pipeline is built — the root rank process sends to the next rank, which sends to the rank after it, and so on. The change of slope between ranks 11 and 12 is a clear sign of inter-node communication with different latency and/or bandwidth. Since the InfiniBand network has higher latency than the shared memory, used for intra- node messaging, a build-up of messages is observed which leads to the compression of the message lines the further it goes in time. Another such slope change is present between ranks 23 and 24, but it does not lead to another bunching of message lines as they have already been spread out while crossing between ranks 11 and 12. The narrow polygon on the left side is an MPI_Barrier collective call and is an example of how opaque are the MPI collectives when seen through the PMPI interface.

Note the overall peaceful feeling, streaming from the communication structure — it almost looks like a laminar fluid flow. No surprise that this is the best performing broadcast algorithm, which is available in Open MPI, when it comes to large messages and a huge number of participating ranks.

A variation of this algorithm uses several pipelines to transport the segments:

/images/coll_tuned_bcast_intra_chain.thumbnail.png

MPI_Bcast with segmented chaining (4 chains)

The root rank feeds simultaneously several separate chains (or short pipelines) — four in this specific case. As the time between two consecutive segments are sent to the same chain is more than the time it takes for the segments to traverse the InfiniBand link between ranks 11 and 12, no bunching of message lines is observed as in the case with the full pipeline.

There are also algorithms that communicate messages over a tree structure. They make for less pretty and more “angry” looking pictures.

/images/coll_tuned_bcast_intra_bintree.thumbnail.png

MPI_Bcast with segmented binary tree distribution

/images/coll_tuned_bcast_intra_binomial.thumbnail.png

MPI_Bcast with segmented binomial tree distribution

Communication in the top picture follows a binary tree pattern while that in the bottom one – a binomial tree. Both trees differ in the distribution of process ranks among the nodes of the tree and in their breadth/depth given the same number of ranks. Although the overall message line density looks nearly the same (especially when looked at from a distance), on close inspection one can see that the “rays” of messages near the bottom actually follow completely different patterns.

Besides being pretty, these traces are instructive too. Many times the really performant implementations of MPI collectives are quite complicated and not always obvious. That’s why it is best to stick with the vendor provided collectives and not to try to reimplement them in your own code (unless one reimplements them for debugging or research purposes).

More trace art will come some day.

Comments