The Java Code Monkey: August 2010

I've had to look into some possible performance optimizations for a product lately and as part of that I wanted to see if there was anything to gain on the serialization/de-serialization front. Therefore, I did a little bit of research on what can be done in terms of customizing object serialization and I thought I would share the small results of my pocking around.

Java Serialization is the basic mechanism for serializing your objects into a bit stream that you can use to store or transmit objects. The usages are many but the usual suspects are storage to disk, RMI and object cloning. Making objects serializable in Java is as simple as making your class implement the Serializable interface. That is at least the theory, since all the fields of said class must also be serializable, i.e. all fields must point to a class that implements Serializable. Should that not be the case you will quickly discover it at runtime in the form of a NotSerializableException being thrown. This is all very simple and quite powerful, a lot of functionality is open to you by just tagging your class with the Serializable interface. Of course simplicity usually comes with a price, and in this case you have to pay a performance tax.

There are generally two standard ways of customizing serialization (well there's a third variation which I am showing later):

Implementing the writeObject/readObject methods.
Implementing the Externalizable interface.

Externalizable gives you full control over the serialization process of an object whereas implementing writeObject/readObject just plugs you into the standard serialization flow. There are differences but most are fairly subtle and not so obvious because they both ask you to implement methods that are almost identical. In the writeObject/readObject case:

and in the Externalizable case:

The concept for both methods is the same. You are given an OutputStream to write the state of the object to and in the other end you get an InputStream to read the state from. Imagine a Class with 2 fields and the the implementation may look like this:

The fact that the writeObject/readObject methods are marked private is not a mistake, they have to be, otherwise it does not work. Actually, any mistake in the signature will generate no error during compilation but will produce no result at runtime either. Although most IDEs will help you now, it is quite easy to make a mistake, whereas Externalizable guarantees a compilation error if you made a mistake in the method signature. Besides from that the methods look very similar and would actually in many cases have the same implementation. The implementation shown above could as well have been the implementation for writeObject/readObject.

I performed a bunch of different implementations of serialization for a simple bean with 12 fields on it. It has 3 Long fields, 3 Double fields, 3 String fields and 3 Date fields. This is fairly representative of the objects transferred in the project of interest to me right now. The raw results are shown below. I have chosen two measurements, the time it takes to serialize/de-serialize and the size of the object when serialized. The test is run on Java 1.6_21 64 bit (server mode) on a standard PC with Intel i7 920 2.67GHz with 6Gb of memory. The code is available here. You will of course not get the same times from one run to the other but the proportions should remain the same. Times are averaged over 5000 objects serialized and repeated a number of times. The sizes are also an estimate because the test beans contain random Strings which depending on the content do not serialize equally. All in all though this varies little from one run to the other.

Bean Used	Serialization (ms)	De-Serialization (ms)	Total (ms)	First Object Size (byte)	Subsequent Object Size (byte)
Standard Serialization	40	30	70	597	201
Dumb Externalizable	29	19	48	377	198
Standard Serialization with Primitive Fields	17	12	29	410	160
Dumb Externalizable with Primitive Fields	14	12	26	245	168
Efficient Serialization	7	7	14	427	148
Efficient Externalizable	9	5	14	198	145
Efficient Externalizable with no null Handling	8	5	13	194	132

Okay so what does this mean, to better understand here is a short description of each bean used.

Standard Serialization: Nothing special is done but tag the bean with the Serializable interface.
Dumb Externalizable: The bean implements Externalizable but all it does is call writeObject for each field on the object.
Standard Serialization with Primitive Fields: Nothing special is done but tag the bean with the Serializable interface, the only difference here is that primitive fields are used instead of the object wrapper (e.g. long instead of Long)
Dumb Externalizable with Primitive Fields: Same as the Dumb Externalizable but primitive fields are used instead of the object wrapper. Which also means that we do not use writeObject but for example writeLong for these fields.
Efficient Serialization: Implements writeObject/readObject and does not use writeObject but transforms the object into its primitive type first. In the case of the Date objects, the time is taken as a long millisecond and in the other end the date object is recreated using the time in milliseconds.
Efficient Externalizable: Same as the Efficient Serialization case except the Externalizable interface is used.
Efficient Externalizable with no null Handling: Same as Efficient Externalizable but all fields are assumed to be non null.

A few things seem obvious by looking at the result:

Even a dumb implementation of Externalizable does better than the standard serialization.
An optimized implementation can save a significant amount of time and size.
Using primitives in your data gives a boost to serialization.
Serializable always produces a bigger size for the first object.

Now the first one is to be taken with a grain of salt. It is faster because the standard serialization relies on reflection and even a dumb implementation does better than that. It seems however that the more objects are serialized the less the difference between dumb and standard serialization. I suspect this is due to hotspot doing its job and basically optimizes the standard code to the level where it basically is the dumb implementation. Still if the serialization is not used enough that this optimization will kick in then doing even the most basic of implementations will save some time.

More interestingly is the optimized implementations. You have to be able to do it, if your bean only has primitive field, you will not get far. However in the case of more complicated objects such as Date the fact to just send the millisecond representation can save a lot of time. You do loose information such as the locale but if that does not mater to you because all time is set to UTC anyway then there is much to gain.

Using primitives gives an immediate advantage but you do loose some information as well. You cannot tell that a field has not been set. In java, a Boolean field has 3 possible states: true, false and null. I'm not saying it is good but this actually maps quite well to what a database supports, so the wrapper is likely more useful than the primitive.

If sending the stream over a network, size can be as important as the time. Externalizable has an obvious advantage if only one object is sent. This is because standard serialization sends the object definition with the first object. This is not the case when using Externalizable
(although some information about the object is automatically sent, such as the type).

So all of this is great, why aren't you already coding your beans with custom serialization? Well as always things are not free and there's a cost to this. The main cost is going to be maintenance and by that I am not simply referring to the time spent keeping up to date the serialization, but also the time spent chasing mysterious bugs because someone forgot to update it. I would say that unless a lot of serialization is going on in your application it is probably not worth it. A way to alleviate the maintenance issue, generating the code needed for the serialization code should be considered.

Okay so you are going with it and you are going to use custom serialization so which method should you use? The impact on time spent is not much different between Externalizable and writeObject/readObject. There is a size advantage for Externalizable but only for the first object. There is a very significant difference between writeObject/readObject and Externalizable. Externalizable promises total control over the serialization of the object and it is actually true. This becomes apparent if you are extending another class. Consider the following base and class extending it.

Now we will implement the serialization for the SerializableExtendingBean, first the writeObject/readObject version:

Then the Externalizable version:

They look very similar, except that one of them works and the other does not. The writeObject/readObject will work but the Externalizable will produce the following output if used to clone an object:

Unsuccessfully cloned. Fields are missing, original: {company='Doe Inc.', position='CEO', name='Doe', surname='John', birthDate=Thu Jan 01 01:00:00 CET 1970}

Clone: {company='Doe Inc.', position='CEO', name='null', surname='null', birthDate=null}

All the fields in the base class have been reset to their default value. But why were they not reset in the other writeObject/readObject version. Well that is linked to the fact that the writeObject/readObject methods are private. The methods are not only called on the SerializableExtendingBean but also on the base class (in essence at least), meaning that we still benefit from the default serialization for the base class. To actually make the Externalizable version work we would have to do something like this:

Now you really have yourself a maintenance nightmare with the Externalizable interface, if you add a field to the base class you have to update any Externalizable class that extend it. Of course you can make this a lot easier by having the base class implement Externalizable as well and have extending classes call super. This implies that you have control over the base class, which is not always the case.

For this reason if you have to deal with a hierarchy of objects, I would recommend using writeObject/readObject just because the odds of making a mistake are minimized as you do not have to worry about the parent classes. Externalizable is more flexible but if you are only dealing with simple beans you are unlikely to really need it.

Earlier I mentioned there was a third way to do serialization. It is more an expansion of the two other methods. It is the usage of the readResolve/writeReplace methods (private again). The basic idea is that you will delegate serialization/de-serialization to another object. An example below:

So the object actually sent into the serialization stream when a SecurityClearanceCustomSerializationBean is encountered is an object of the internal class SerializedObject. What is the advantage of doing that? Well it is the only way I know of to customize serialization for an object with final fields and no default no arguments constructor. Also in the case of the static definition for default values of the class such as TRAINEE we can actually reduce the serialized stream to a single byte (well for the part that we control, it is not going to fill a single byte there is always some definition overhead for an object even an Externalizable one). Also if you clone a TRAINEE object then the clone will still have reference equality with the static field.

To finish, an interesting detail, that is not, as far as I can see, documented anywhere. What happens if you have a circular reference between two objects? Do you need to do anything? Let's look at the following example:

Imagine two beans which point to each other via the bean field. When you serialize one of the objects it will call writeExternal, which will serialize the other bean via the writeObject call, this triggers writeExternal on the other bean, which should again try to serialize the first object. This however does not go into an infinite loop. The reason for this is that the writeObject does not re-serialize an object if it has already been called with the same object reference before. This also means that if the object has changed internal values in between then too bad, the stream is not updated. It seems reasonable though, because solving the cyclic reference would be quite painful to deal with each time you want to do custom serialization.

That was a little bit more than I had planned for, but I hope you will find this useful if you ever want to do some custom serialization.

The Java Code Monkey

Sunday, August 1, 2010

Java Serialization: Using Serializable and Externalizable and Performance Considerations