The case: RPC vs. Messaging

There’s a classical flamefest discussion about Remote Procedure Calls (RPC) vs. Messaging. People much brighter than me has discussed it elsewhere, but that doesn’t stop me from throwing in my 2 cents. It appears to me that there is a whole crowd of people still refusing to realize why RPC is so bad.

Before I get too deep in this let’s get the terms RPC and Messaging more well defined. I wont claim that I have the “correct” definitions, but here’s what I mean when I use those terms: RPC is a mechanism that allows you to call methods on remote services as though they where methods on a local object. In pseudo code:

calc = lookup_calculator_service("127.0.0.1", 8080)
four = calc.add(2, 2)
eight = calc.multiply(2, 4)
print ("Result of (2+2)+(2x4) = " + calc.add(four, eight))

For Messaging consider it like email, not between people but between different apps on different machines. A message is typically some container-like format with some extra metadata naming the sender and the recipent(s), maybe timestamps and serial numbers. All you can do in a messaging system is basically to send a message to a particular address. Whether or when the resident at that address responds is not possible to determine – just like email in that sense. For a large scale example of a messaging system we have the internet itself. The very much hyped REST interactions of online services is also an example where messaging is starting to show success.

Back to the RPC example above – it’s very convenient and easy to work with right? If this example is really all you need to do, then I tend to agree that this kind of RPC is fine. But what happens if you are writing a mission critical system where data integrity is paramount, you have lots of interconnected services, and needs low latency and high throughput? Let’s examine the situation a bit…

The server might be implemented as:

function add (num1, num2) {
    return num1 + num2
}

The RPC system would then wrap the server object and expose some predfined methods as remote methods. It magically parses incoming calls and delegates control to my server’s add() function giving it the right arguments.

Problems of RPC

What happens in line 2 in the client code above if calc.add(2,2) causes the calculator service to go out of memory? Some RPC systems like Java RMI has the “feature” of sending you the raw exceptions as they happen on the server directly. In case of an OutOfMemoryError (OOM) the exception would completely escape the server’s logging or critical error handling and be send to the caller. Our calculator client then gets an OOM without the slightest chance of figuring out whether it is itself OOM or the server is OOM. And all while the client thinks it is OOM, and might crash, the server which is really OOM happily chucks along down whatever path of complete failure lies ahead of it.

This can be solved partially if the client wraps all remote calls in try/catch clauses catching the most general type of error the runtime has. In Java this would be Throwable. Also the server needs to wrap all of its remotely available methods in try/catch in order to shut down nicely (or protect itself in some way) in case of OOM or other critical errors. So our previous example now becomes:

calc = lookup_calculator_service("127.0.0.1", 8080)
try {
    four = calc.add(2, 2)
} catch (Throwable t) {
   log.warn("Error adding numbers!")
   return
}
... Nah... I am pretty sure you don't want to read the rest of the try/catch hell

As you of course realize this can all be solved by thorough exception handling in both clients and servers. It wont be fun, but it can be done. Let’s call this problem the Non-Local Exceptions Problem.

The next problem inherent in RPC could be called the Indirect Memory Allocation Problem. This problem arises anywhere you accept a datastructure of an arbitrary size in your methods’ arguments, eg. an array. Suppose I change my calculator server’s API to be more flexible, so that the add() method takes an array of numbers to add, like calc.add([1,2,3,4,5,6]) = 21. Now what happens if a client sends me an array with 10^9 numbers to add? If we assume that a number is 4 bytes, then the RPC system on the server will try to allocate 4*10^9 = 4GB for the array before passing control into add(). This will likely cause the server to OOM before even reaching into my method.

To handle the indirect memory allocation problem I must either be able to ensure that my RPC system will not allow clients to send such huge arguments, or be able to parse the arguments in some streaming manner on the server side – but the latter does not sound a lot like RPC does it?

– and note that the indirect memory allocation problem is not only on the server side. The server may also return a huge datastructure as a method response so the client needs to guard against this too.

Next up on the list of problems is the Blocking Calls Problem. When the client calls to the server it issues a request over the network and really has no way to anticipate when that call returns. While it waits it blocks the thread from which it is calling (or at least all RPC systems I know does this). So if you want to do concurrent calls you’d have one thread per call in progress. If you’ve never seen an app go belly up because of thread starvation or I bet you’ve never programmed multi threaded production systems. Blocking calls make your system more fragile and also much more affected by network latency.

Skipping on to the next problem, this one particularly strikes strongly typed programming languages (like Java, which we use a lot here at the State and University Library of Denmark). Let’s call it the Static Interface Problem. In a strongly typed language you need to be able to resolve the method signatures at compile time (that or use varargs signatures everywhere -eeeks!). In order to do this one frequently hand writes or autogenerates some interface- or stub classes. If the remote API changes you app is likely to crash or simply not run at all – the interface classes needs to be regenerated and your code recompiled against these new interfaces. If you are a purist you might say that such pubilc interfaces should never change and that I must surely be a slacker since I even bring this up, but the sad fact of the matter is that in real life you can not control the entire world and interfaces do change.

Looking back RPC have:

  • Non-Local Exceptions
  • Indirect Memory Allocation
  • Blocking Calls
  • Static Interfaces (in strongly typed languages)

The way these problems are solved in an RPC context is typically to write a CalcClient class which does the needed client side magic (catching exceptions, delegating work to an async thread, hides the remote interface declaration etc.) and then pass a bunch of HashMaps or parametized Value types around with each method where you can stuff any arguments you need to add to the interface in a backwards compatible way. The only thing that is nearly impossible to tackle is the indirect memory allocation problem.

Enter Messaging. Messaging solves all of the above problems in one fell swoop, and if you decide to use a standard, like HTTP, for the connections then you can even talk to you messaging services via you browser or standard Unix command line tools like wget or curl.

Tooting my Own Horn

The above list of problems is not just pulled out of my hat. We have seen, and faught, them all in Summa.  To start moving down the messaging road i started the no-nonsense Juglr project on Gitorious. It’s still far from ready but it’s coming along nicely. In a nutshell it is an Actor model implementation coupled with a JSON/HTTP high performance messaging system. In order not to reinvent the wheel too much I am basing the actor model implementation on Doug Lea’s Fork/Join framework that is also scheduled for inclusion in Java 7.

Real life examples

A non-complete list of the RPC systems I’ve crossed paths with:

  • Java RMI
  • SOAP
  • CORBA

A ditto list of Messaging systems:

  • HTTP and Email
  • REST(ful) web services
  • DBus
  • Protocol Buffers

The last two: DBus and Protobufs deserve an extra note. When you get down at the protocol level these two systems are indeed both messaging systems, but they are most often used as RPC systems! I am honestly now sure why it is so, but it’s probably because it is (deceptively) easier to get started with an RPC based approach.

This entry was posted in Hacking, kamstrup and tagged , , , , , , , , , , , . Bookmark the permalink.

8 Responses to The case: RPC vs. Messaging

  1. Prashanth says:

    Hi there,

    Skimming through Summa code, it seems like your storage and search web services are RPC-based. Reading through this post, it would seem like you prefer messaging systems, so why/how was the direction to go RPC in Summa chosen? Do you regret doing so?

    – Prashanth

  2. kamstrup says:

    Thanks for your interest Prashanth.

    Summa was started in 2005 and at the time we simply didn’t have the necessary experience with scalable distributed systems and RPC (notably Java RMI was seductively easy to get started with). Furthermore I don’t think that it was expected that Summa should scale to the massive data sets we have on the horizon now (1G records).

    We are talking about moving all IPC in Summa to a message based approach for Summa 2.0, but it is a big change and we haven’t fully thought it through. In any case – if I was to write a big system like Summa from scratch again I would no doubt go for a message passing IPC system.

  3. Prashanth says:

    Haha, I completely understand where you’re coming from🙂 You mentioned you that the team is looking towards a message-based system for Summa 2.0, any architectures/technologies/software in specific?

    FYI, Juglr seems extremely interesting and I’d love to contribute. I’ve lived in the RPC world (EJB’, SOAP WS, ESB’s) for a long time and I’d like to make a change. It’s even better that you’re working with a yet-to-be-standardized spec for Java 7. Are you looking for any help there?

  4. kamstrup says:

    No we don’t have anything specific (maybe you have a proposal?). We have also been looking into the Actor pattern and scripting extensions to the runtime (fx. Javascript), and I think we want a messaging system that is lenient to these goals.

    Also we have a dislike for huge frameworks where a third of the code is in domain specific XML, another third is autogenerated, and the last third is our own. We really want a simple lib (or set of libs) that enables these features with no fuzz.

    As I couldn’t find a messaging system that was geared for this I started Juglr. For the scripting and extensibility I started Jumpr (hoorray for original naming! :-)). See http://sourceforge.net/projects/jumpr/.

    If you want to get involved in Juglr please hold on a week or two. I hope to roll the first release very soon. It will be incomplete, but have the basic functionality ready. At that time I’ll also have a clearer roadmap and we can see how your ideas fit in. Any extra hands would be most welcome!

    I’ll announce Juglr 0.1 on this blog, so you should pick it up.

  5. Pingback: Juglr 0.0.1 « Software Development at Statsbiblioteket

  6. Pingback: Fastest reliable way for Clojure (Java) and Ruby apps to communicate - Question Lounge

  7. Pingback: Procedure calls | Some Things Are Obvious

  8. missingfaktor says:

    How would you categorise Distributed Smalltalk? It was an implementation of CORBA, but like everything else in Smalltalk, was essentially message passing across objects.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s