MPI or Not MPI – That is the Question

Head of Products and Services at ELEKS

One of the most popular questions to us at latest GPU Technology Conference was “Why wouldn’t we use MPI instead of writing our own middleware?”.  The short answer is “Because home-made middleware is better for projects we did”. If you are not satisfied with such explanation – the longer answer follows.

MPI vs. custom communication library

First of all, one should always keep in mind that MPI is universal framework, built for general-purpose HPC. It is really good for, lets say, academic HPC where you have some calculation that you need to run only once, get results and forget about your program. But in case if you have commercial HPC cluster, designed to solve some particular problem many times (let’s say, do some kind of simulation using Monte-Carlo method), you should be able to optimize every single component of your system. Just to make sure that your hardware utilization rate is high enough to make your system cost-efficient. With your own code-base you can make network communications as fast as possible without any limitations. And what is very important you can keep this code simple and easy for understanding – which is not always possible with general-purpose frameworks like MPI.
But what about complexity of your own network library? Well, it is not so complex as you could imagine. Some tasks (like Monte-Carlo simulations) are embarrassingly parallel, so that you don’t need complex interactions between your nodes. You have coordinator that sends task to workers and then aggregate results from them (see our GTC presentation for more details about that arhcitecture). It is relatively easy to implement lightweight messaging library with raw sockets, you just need good enough software engineer for that task.
And last, but definitely not least: lightweight solution, written to solve some particular problem is much faster and predictable than universal tools like MPI.

Benchmark

Our engineers compared performance of our network code with Open MPI on Ubuntu and Intel MPI on CentOS (for some reasons Intel MPI refused to work on Ubuntu). They tested multicast performance, because it is critical for architecture we use in our solutions. There were three benchmarks (described with kind of pseudo-code):

1. MPI point-to-point
if rank == 0:
 #master
 for j in 0..packets_count:
  for i in 1..procesess_count:
   MPI_Isend() #async send to slave processes
  for i in 1..procesess_count:
   MPI_Irecv() #async recv from slave processes
  for i in 1..procesess_count:
   MPI_Wait() #wait for send/recv complete 
else:
 #slave
 for j in 0..packets_count:
  MPI_Recv() #recv from master processes
  MPI_Send() #send to master processes
2. MPI broadcast
if rank == 0:
 #master
 for j in 0..packets_count:
  MPI_Bcast() #broadcast to all slave processes
  for i in 1..procesess_count:
   MPI_Irecv() #async recv from slave processes
  for i in 1..procesess_count:
   MPI_Wait() #wait for recv
else:
 #slave
 for j in 0..packets_count:
  MPI_Bcast() #recv broadcast message from master processes
  MPI_Send() #send to master processes
3. TCP point-to-point
#master 
controllers = [] 
for i in 1..procesess_count: #waiting for all slaves
 socket = tcp_accept_as_blob_socket()
 controllers.append(controller_t(socket), )

for j in 0..packets_count:
 for i in 1..procesess_count:
  controllers[i].send() #async send to slave processes
 for i in 1..procesess_count:
  controllers[i].recv() #wait for recv from slave processes

#slave
socket = Tcp_connect_as_blob_socket()#connecting to master
for j in 0..packets_count:
 sock.read()#recv packet from master
 sock.write() #send to packet to master 
We ran it with 10, 20, 40, 50, 100, 150 and 200 processes, by sending packets of size 8, 32, 64, 256, 1024 and 2048 bytes. Each test included 1000 packets.

Results

First of all, lets look at open-source MPI implementation results:
Open MPI @ Ubuntu, cluster of 3 nodes, 10 workers:
Open MPI @ Ubuntu, cluster of 3 nodes, 50  workers:
Open MPI @ Ubuntu, cluster of 3 nodes, 200  workers:
So, Open MPI is slower than our custom TCP messaging library in all tests. Another interesting thing, Open MPI broadcast sometimes is even slower than iterative point-to-point messaging with Open MPI.
Let’s look at proprietary MPI implementation by Intel. For some reasons it didn’t work on Ubuntu 11.04 we use on our test cluster, so we decided to do a benchmark on another cluster with CentOS. Please keep that fact in mind – you can’t directly compare results of Open MPI and Intel MPI as we tested them on different hardware. Our main goal was to compare MPI and our TCP messaging library, so these results work for us. Another thing: Intel MPI broadcast didn’t work for us, so we tested only point-to-point communication performance.
Intel MPI @ CentOS, cluster of 2 nodes, 10 workers:
Intel MPI @ CentOS, cluster of 2 nodes, 50 workers:
Intel MPI @ CentOS, cluster of 2 nodes, 200 workers:
Intel MPI is much more serious opponent for our library than Open MPI. It has 20-40% faster results on 10 workers configuration. It has comparable performance on 50 workers (sometimes faster). But on 200 workers it is 50% slower than our messaging library.
You can also download Excel spreadsheet with complete results.

Conclusions

In general, Open MPI doesn’t fit requirements for middleware in our projects. It is slower than custom library and (what is even more important) it is quite unstable and unpredictable.
Intel MPI point-to-point messaging looks much more interesting on small clusters, but on large it becomes slow in comparison with custom library. We experienced problems with running it on Ubuntu and it might be a problem in case you want to use Intel MPI with that Linux distributive. Broadcast is unstable and hangs up.
So, sometimes decision to write your own communication library looks not so bad, right?

tags

Comments