Monday, January 23, 2012

Objects Revisited

Alan Kay is the  inventor of Smalltalk, the first fully truly object oriented language. I learned Smalltalk in the early eighties and almost everyday that I use Java I am crunching my teeth that James Gosling did not steal more ideas from Smalltalk. About 20 years ago, during an OOPSLA, Alan Kay presented the idea that data should always carry its own methods to access that data. His example was a tape (!) that would contain the data as well as the code to interpret that data.

I think this idea was very much at the core of the Java Management standard first proposed around 1997. Each device would have a Java VM on board and the management system could send little management programs that would be executed on the device. However good it sounded at the time (and I tried to push this idea in Ericsson) the idea never became successful, it was just too complex to make it work reliably on a larger scale where machines have different versions and are implemented in more languages I could ever learn in this life. It was just too complicated, error prone, and risky. Exchanging, or relying on, arbitrary code between loosely coupled machines turned out to be a surprisingly bad idea. Objects, however useful they are in many places, seem to be getting more and more in the way when you build larger distributed systems.

The reason is that objects are so ill suited to go outside their process is that they force the objects to expose their innards, the very thing objects try so hard to hide. Even if we could encapsulate the data during the transition as Alan Kay suggested we would create a huge burden on the receiver to understand (and trust) the code that encapsulates the data. We also created a huge dependency problem that the code provided with the data can actually correctly run on the receiver.

There has always been an impedance mismatch between persistence and object orientation. JPA does a decent job but there is something fishy when you need such huge, complicated, and performance intensive middleware only to simplify the life of the developer. Recently I've been doing some more thinking about this subject and I think that though objects work beautifully in a single process they are ill suited for anything that involves crossing the process boundary, which obviously includes persistence.

Last week during an OSGi EG conference call the problem came up again during the discussion of a specification: do we support serialization for some of the domain objects or not? What is often not realized is that serialization is a public interface since it is shared with the world, it is not an internal implementation detail. This is the essence of modularity, there is an inside and there is an outside. What is on the inside only can be changed what escaped from the inside must be carefully (and thus more expensively) evolved since its dependencies are unknown. 

The problem is acute with interface based programming. Two systems running a service defined in interface S (maybe separated in time) that need to communicate their domain objects can only do so if the specification for S defines a serialization format.  Putting a serializedVersionUID in an interface is a total waste of bits (although they do occur!). The only solution that I see is that we need to make the marshalling a first class citizen in the contract since the data representation is part of the public API.

However, what format should be used? The standard Java serialization format is quite awkward to parse except for implementation classes.There is good old XML but JSON is increasing in popularity and there are enough other serialization standards out there to fill books. SQL is also a kind of serialization format. Picking one without making others unhappy will be hard.

I've come to the conclusion that the best format is actually ... Java.  I started to use what I call data classes. These are classes with only public fields of primitives (or their wrappers), strings, data classes, and collections or arrays of data classes. This subset is very easy to (un)marshal to almost any available marshalling technique using simple rules and reflection. These data classes can act as a very convenient schema for my public interface to other processes, including me in the future (a.k.a. persistence). Since they are part of the Java type system they are easy to use and the compiler can do a lot of sanity type checking. And they can easily be versioned in OSGi.

The data classes are a solution to a problem I see becoming prevalent. It is against pure object orientation but I honestly do not see another solution; The shared code model just does not work very well. Sad, but I think it is time to declare defeat, maybe Java 8 should not steal from Smalltalk but the struct from C?

Peter Kriens

3 comments:

  1. I definitely find this post interesting. The last few paragraphs discussing storing data in separate classes is the recommended approach with my component oriented framework TyphonRT. My efforts extend the "entity system" concept architecture wide also with leanings on data oriented design concepts which are a rising trend in game development. The two main component types are data and systems. Data components store just the data often with public scope that systems act upon (the logic).

    I do plan to still address module versioning with OSGi, but have developed a lightweight / fast interface based programming API. I have yet to resolve it against the OSGi service model, but plan to do so soon as it would be nice to support module lifecycle issues and dynamic loading.

    Nonetheless it has been great to get away from idiomatic OOP; IE a focus on the implicit has-a and subsequently is-a relationships over the explicit hard coded has-a and inheritance as the core architecture organizational pattern. In brief the code to access a Position data component from an entity looks like this: entity.getAs(DATA, Position.class).x; DATA being a static include for IData.class to shorten the verbosity.

    The last comment on introducing some form of struct support is something we from the Java game development community have wanted for over 10 years with several posts back on JGO (www.javagaming.org) years ago. IE mapping a NIO byte buffer to data objects fast / efficiently is the dream.

    in my framework efforts serialization is handled by a separate system component that may implement serialization via any mechanism or combination thereof depending on the class / type of the data component being serialized. Reflection may be used for the general case, but for high throughput data something custom coded to the data component version can be used instead.

    Anyway I like this post as it leans towards the path I've been traveling on for the past 3+ years. I'll hopefully have a public release of my efforts later this year.

    We did have an opportunity to grab lunch at J1 back in '04, might have been '06, if you recall a young developer discussing said game engine tech w/ an interest in OSGi. It has been a long, but rewarding journey to say the least.

    ReplyDelete
  2. Using data objects is a best practice in service oriented architectures. It is also often used together with a service facade that makes the domain api more coarse granular. Regarding OSGi I think service facades and DTOs should also work very well with distributed OSGi.

    See also:

    http://en.wikipedia.org/wiki/Data_transfer_object

    ReplyDelete
  3. I'm no expert, but the author of "Kilim" (which erjang is based on) made eloquent arguments for the same in his
    Google Tech talk.
    http://www.malhar.net/sriram/kilim/

    ReplyDelete