Andre de Cavaignac

Let's blog it out...

Cluster Primitives: MPI, MPI.NET, Large Data, and Passing Classes

The Message Passing Interface (MPI) standard, and its .NET implementation, MPI.NET have been some of the cornerstones of development on compute clusters.  The standard supplies a simple yet primitive way of both sending and receiving data between running compute processes.

The large advantage of MPI has been a mix of its simplicity and speed.  A call to MPI Send on one node and MPI Receive on another block both callers until the operation is complete.  Some more complex calls, such as MPI Scatter and MPI Gather allow a single node to distribute data to a set of nodes or retrieve it from a set of nodes.  An MPI Barrier allows all nodes to stop until they have all reached the agreed upon place in code, then allowing them to continue.  Such primitives allow a distributed set of processes to communicate, do some work, and then share values that each needs to continue with eachother.  Because this is all done with some low level, bare metal socket tricks and/or shared memory, the result is blazingly fast communication.

With this simplicity however, comes a trade off.  MPI has been a standard for nearly 20 years and has changed very little since its inception.  The way we program today has changed drastically, especially with managed languages such as C#.  No longer do we tend to worry about memory allocation, or dealing with raw memory.  Today, most languages have a concept of automatic memory allocations, garbage collection and type safety.  Although the primitives in MPI are unparalleled in simplicity for allowing multiple processors to communicate about a shared set of work, some striking limitations are found once we dig a bit deeper.

When I started with MPI.NET, I found the interface very simple.  An Mpi.Send<T>(obj) would send an object to a waiting client.  Mpi.Receive<T>() would give you back that object.  Nothing could be simpler.  In my example, however T happened to be a class that contained a byte[] of an undetermined size.  Once run on the cluster, the size of the byte[] I was passing increased dramatically, and an unexpected exception occured:

AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory has been corrupted.

After lengthy investigation, I found that MPI.NET was attempting to pin some memory in the .NET GC heap and pass that memory location as a buffer to the underlying MSMPI stack.  In doing so, it did not allocate enough memory for my large byte[], causing the write to try to write into the GC heap, thus throwing the exception.  In my case, to solve this, I created a large enough buffer and passed it into an override of Mpi.Receive<byte>(byte[]).  This overload pins the entire array that was passed in on the GC, and then passes that to the MSMPI stack.  On the send side, I manually serialized my class, checked the length (to ensure I would not overflow the receive buffer) and sent the byte[] instead of the raw class.  This solution does not take into accound messages larger than my expected buffer.  For that, I would have needed to chunk down the data.

The moral of the story here is sending primitives, arrays of primitives or fixed-sized structs over MPI.NET (which is the most common scenario) is a great use of a very fast messaging interface.  Once your demands get more complex, the MPI stack gets less favorable, not because of its inability to send more complex messages, but because of the manual labor involved in serializing and chunking down data.

It is no wonder that the HPC community is moving away from the traditional methods of MPI and communication across a set of processors to a Service Oriented (SOA) model. The benefits of using existing components, such as WCF and its NetTcpBinding, the threading models, serialization and transport models, and other features already provided by these frameworks outweights the possible performance penalty.  Problems such as the one explained above simply do not happen with frameworks like WCF.  Furthermore, although the underlying concepts of MPI and its simple messaging model are very simple and appealing, the overall development, maintainence and debugging of a SOA application is much simpler than that of a MPI application.  The amount of code complexity and custom code drops when compared to an MPI implemenation.

The general industry trend seems to be towards SOA models.  Microsoft Windows HPC Server 2008 is a great example of this.  HPC Server uses WCF to distribute load across the cluster, and can even dynamically scale resources depending on demand of a particular service.  Platform, another industry competior has been building with a SOA model for some time now. 

I'm looking forward to playing with HPC Server 2008 and WCF more as time progresses.  I think that the WCF model will solve a whole bunch of headaches that one incurs when trying to communicate over perhaps over simple primitives such as MPI.  Many models and workloads simply do not require the type of communication MPI provides, and using MPI can be like fitting a square peg into a round hole.  This is not to say MPI does not have its place, many complex processes do require constant communication between a set of workers, however I believe many of the problems we use HPC for today can distributed using SOA in a much simpler fashion.

Comments

Christopher Steen said:

Link Listing - May 1, 2008

# May 2, 2008 4:27 AM

Christopher Steen said:

ASP.NET Deserializing JSON into a list of abstract base classes [Via: Kyle Baley ] Sharepoint Microsoft...

# May 2, 2008 4:27 AM
Leave a Comment

(required) 

(required) 

(optional)

(required)