Sunday, September 16, 2007

JMS: So simple yet sometimes so tricky

The Java Messaging Service (JMS) API as provided as part of the Java EE appear to be a small, limited scope API. However, appearances can be deceiving; experience has shown that a lot of developers make small but far reaching mistakes. JMS is a useful tool that allows to easily implement asynchronous communication between components. It also allows for cross platform integration with implementations in C and C++ of the API. Therefore, to some extend JMS is the golden axe of the architect to many problems like cross system integration, scalability and reliability. Not as buzzword compliant as web services, but still very useful and used, so how does this translated in the hands of the developers using it?

First let us briefly summarize what JMS is about. It is a connection based messaging system with two main abstractions for messaging channels, namely topics and queues. Queues and topics have very different semantics. Queues are for one to one communication and Topics are for on to many communication. Queues will ensure that message will buffer up the message until a consumer, well, consumes it. Topics will not hang on to it and only if a consumer is listening at the time the message is generated (The exception being durable subscriptions). To be able to receive or send a message you need four things; a connection factory, a connection, a session and a destination. The connection factory and the destination are acquired through a lookup in the InitialContext (I’ll skip the entire setup of resources). A connection is acquired through a connection factory and the session is acquired through the connection. When you have acquired all these, you can initialize a consumer or a producer from the session. As can be seen a lot is involved before you can send a message or even start listening. However, once you know how to do it, it’s pretty much the same each time, so it should be tenuous, but trivial. Right?

Well as it turns out, not really, at least not if you take into account the number of bugs that get generated by developer’s working with JMS. Don’t get me wrong, it is not a dark abyss of software bugs but compared to the simplicity of the API, it seems surprising. So what are usually the big sinners? I’ve mostly encountered three; thread safety, resource management, and fault tolerance.

So let us start with thread safety, now if you have been reading the spec, you know all about it and it is actually explained in there how you are supposed to do. That is were most architects say “great! It all taken care of, thought through and ready to use by the developers.” The big mistake here is that most developers don’t read the spec. I have to admit myself to not reading the spec for everything that I use, but the fact is that I probably should. Actually, it took the first mistake using JMS for me to read the spec. In there, it is VERY clearly stated that Session and MessageProducer/Consumer are not thread safe, yet a lot of developer’s just fire them up and cache them without any regard to the number of threads going through there. On a client you might get away with it most of the time, although you will have some mystery bugs. On the server, you’ll probably notice quickly because your transaction will start misbehaving if you share sessions. So how to fix this? You could force all developers in your organization to read the spec, but forcing people to read a 120+ pages rarely does any good. So I would propose adding a simple requirement to the programming guidelines of the project (I’m not talking about the guideline that tells you where to put the braces, I’m talking about the useful one not written by QA). Simply require all Session and Producer/Consumers to be local variables. Local variables are by definition thread safe, and this requirement is fairly easy to check with static code analysis.

Okay, let’s move on to resource management, JMS is quite resource heavy. Connections, sessions, producers and consumers are all resources, by which I mean you actually have to close them once you are done. If you don’t, you have a resource leak on your hands, which can be difficult to track down. There isn’t really any good constructs in the Java language to prevent leaks from the framework perspective. You have the try-finally construct, but then you are relying on the clients to prevent the leak. Closures might make an appearance to help this later on, but it still has to be seen if it ever makes the cut. However, static analysis once again can help detect potential bugs, if your close call is not in a finally block (or not there at all), you probably have found a bug. On an application server, managing resources is easy, since it is up to the application server to manage them, you just have to give it a chance to do it. That means setting up a connection pool on the application server, not hold on to the connection and remember to close the connection. To hold on to the connection is a common mistake, after all you are supposed to fetch resources on ejbCreate() and release them on ejbRemove(). Well, that would be a mistake, because a JMS Connection is not expensive to get if it is in the pool is up and running, but by holding on to it, you might force the application server to create new connections which is expensive at creation and to maintain. Therefore, if on the server the same recommendation than for Sessions in the previous section applies, always make your connections local variables. On the client, things are a little bit more complicated. Nobody is managing resources but you, so you have to be a little bit more careful. You could just create your connections and release them right away when you are done. This would work well if you only send messages from time to time. Of course, you can’t do that for listeners. If your clients are heavily using JMS, you might want to implement your own pooling mechanism. This is easier than it sounds, basically all you need to do is put a wrapper around the ConnectionFactory, and since you probably already have a service locator implemented, you could have it return the wrapper instead of the vendor’s connection factory. The advantage of this is that no matter how well the vendor’s implementation scales you can tweak your end of the system.

Last but not least is fault tolerance. This is actually not a part of the JMS spec, they did not want to put up rules for what should happen when the services start failing for a reason or another. For example, nowhere does it specify what to do if a queue fills up (i.e. the consumer is not there anymore). So it goes from throwing an exception to blocking, to just throwing away messages. With these kinds of conditions it is difficult for the developer to know what to do since it will depend a great deal on the JMS provider implementation. This is probably the least compelling about the JMS spec, it defines very well the success scenarios but when it comes to the edgy failures they leave it to the implementation. Flexibility is good, but it makes it a pain to switch from one implementation to the other. A tempting solution is to wrap the vendor’s implementation with your own implementation that will enforce a certain behaviour. This is not very satisfying because that defeats the out of the box solution that JMS should be. The only recommendation is to read the vendor’s documentation very carefully before choosing (or switching to) one to make sure that it offers satisfying fault tolerance mechanisms for your project.

So all in all, while I still consider the JMS spec to be very well designed there are a few practical matters that make it difficult to get right in practice. Some are due to developer “laziness”, but others like a total lack of specification of any fault tolerance or reliability mechanism are embedded into it. However, nothing that cannot be solved with a little discipline on both the architects side and the developers side, so with this in mind, it should be a tool of choice in your asynchronous communication toolbox.