Pages

Sunday, March 30, 2014

The Atomic Optimization Problem

A decade and more ago I worked on XML software instead of Healthcare standards.  One of the projects I watched fail but didn't work on suffered from a major optimization failure.  Overall, the system design seemed OK, but it had one big problem. The fundamental mechanism by which information was stored was based on entities that were too small to optimize in the storage system.  Now if you know XML, you know that there are a lot of different ways by which you can identify each node in the XML tree, including storing the ancestor/child relationships, using XPath or XPointer to index nodes, et cetera. In this particular case, the XML tree was indexed by one of these mechanisms. And so was the schema for the XML generated. The beauty of the solution was that it worked for every XML tree, and arbitrarily complex schemas. The failure of this system was that all nodes in the XML tree were treated as equal, and there was NO way to optimize behaviors for different kinds of nodes or different schema structures which applied to each node.  So, there were huge performance problems, and the database needed to be majorly redesigned, which resulted in a huge amount of rework, and eventually, project cancellation.

Why this results in implementation failure is something that not everyone gets.  You see, optimization relies on being able to discern differences between different kinds of entities, and allows you to make implementation choices between representations that make some operations more efficient than they would be when other choices are made. For example, denormalization is something that database architects sometimes do to trade between storage space and speed.  The whole notion of a star schema is a specialized denormalization of data that supports hyper-fast query at the cost of storage.  However, when everything is the same, and the system has no way of identifying things that could be dealt with more efficiently, it becomes very difficult to optimize.

Systems based on simple atoms (doubles, triples, or quadruples) can be very efficient, because each atom is the same and you can always build bigger concepts on smaller ones. This makes for very powerful systems, but not always ones which handle the bigger concepts as efficiently as they could be handled.  Consider LISP (the CAR/CDR pair) or RDF (the subject-object-predicate triple) or even the HL7 RIM (act, participation, role, entity).  These models make for very powerful systems that are built on small (usually very small), quite efficiently handled atoms.  However, each large concept is built up on smaller concepts which are built up on smaller ones and so on.  So even though your "atom" is very efficient, the larger concept you have to deal with can rely on hundreds, or even in some cases, thousands of atoms.  And that is where the performance problem becomes nasty.

When the "schema" for the object is also represented in doubles, triples or quads, and the storage system always handles each one of those units the same way, you've lost all chance at optimization.  And while I know it can be done, I've seen few products that even try because this sort of optimization is truly a black art.

Even moving higher up the tree, as in the RIM, to act, participation, role, entity doesn't always cut it.  There's just too much variation in the details of act or entity (or other class) that are of varying importantance depending on what you are trying to do.

One of the things I like about FHIR is the fact that the atomic level is dealing with Resources, and that each resource can be separately optimized to meet specific implementation needs.  I think this is the right granularity for efficient implementations.  Granted, I know the existing open source FHIR servers are primarily using machine generated coded built from the resource definition tables in FHIR, but that doesn't stop someone from doing some decent optimizations on specific resources to meet an implementation requirement.  I would expect a number uses of FHIR are likely to start from existing database stores which are already (or hopefully so) optimized to meet specific implementation requirements.

No comments:

Post a Comment