Tuesday, April 12, 2011

Affects of Efficient XML on HL7 Version 3

Diego Kaminker (cochair of the HL7 Education workgroup) reminded me this morning that I'm overdue on a post about Efficient XML, a new standard recently recognized by the W3C.  The creation of this standard is pretty significant for several reasons.  As designed, XML (and its predecessor SGML) were created to be text markup languages suitable for expert human users to use to annotate electronic text.  Because of this design, XML has become very easy for software engineers to use for a variety of different tasks.  XML users (and SGML users) have become quite familiar with editing raw markup directly in text files.  But the most common use for XML and its predecessor was software processing of the text and associated markup.

Uses for this markup abound and include display and formatting of text, book production (where I first encountered it), communication of software commands and responses to them (e.g., Web Services), structuring of tabular and hierarchical data, electronic commerce, et cetera.  One of the major complaints from the EDI world was that XML was notoriously costly for messaging in several ways:

  1. Data Size
    Converting data elements from a binary format to text-based formats increases the size of the data, often by an order of magnitude.  The use of XML tags to delimit data elements instead of position in a data field, or simpler delimiters creates quite a bit of additional bytes to transmit.  End tags in XML are quite redundant -- useful for humans, but not all that useful for computers at a certain point in the production cycle.  These additions can add yet another order of magnitude to storage requirements.
    Impacts on Data Size affect:
    1. Storage Capacity
    2. Transmission Bandwidth
    3. Memory Utilization
  2. Processing/Marshalling
    Converting from binary data types for numbers, dates, times and similar data types to text requires computing time.  Dealing with all those start and end tags, and parsing decisions on the text also requires computing time.  The compute time spent on these tasks could be better spent on OTHER things, especially given that parsed XML has similar representations on many different platforms.
These problems make it difficult for devices with limited resources to efficiently use XML for computation or communication, even though the format has numerous other advantages for software development.  What are some of these benefits?
  1. Ready access to tools which make content visible and editable.  Because XML is fundamentally text, any text editor can be used to open it up and edit it.  You may not recall a time when this was a problem for other data, but I do.  
  2. Standards AND tools for describing the content allowed to be in an XML document.
  3. Standards AND tools  to translate the content from one format to another.
  4. Engineer (which is not necessarily the same as human) readability.
  5. Implementability ... one of the guiding principals of the XML work was that it had a particular complexity goal in mind.  An XML parser should be implementable as a semester long college Senior project.
These benefits, and Moore's law trends in storage, network speed, memory sizes and processor speed have meant that the processing and data size issues have not significantly interfered with XML dominance as a data syntax.  But, small devices, or large bandwidth applications have still had some problems.  In the UK, one of the reported problems in adopting HL7 Version 3 was the verbosity of the V3 XML syntax.  In this particular case, the volume is on the order of hundreds of millions of messages per day.  The computation resources to parse the XML were significant, as was the bandwidth.

The HL7 Implementation Technology Specification workgroup  began development of an ITS that would simplify (flatten) the XML in 2008.  Those efforts began before I had even started this blog, so I don't even have a post on how I felt about that particular effort.  I can tell you I was quite negative on that ballot and it didn't go forward.  I ran a little test and what I found was that several existing models were only marginally improved (<10%) by the new algorithm.  I believe I argued successfully at that time that the right way to approach the problem was lower in the stack, rather than at the XML ITS layer.  

This argument applies not just to HL7 XML messaging, but to any form of XML processing.  What the application deals with is the XML Infoset, typically stored using the XML Document Object Model .  What is communicated to the application is an XML document.  Between communication and processing is a layer which translates the XML document from the XML syntax into the the XML Infoset.  That's where EXI has a huge impact.  By changing the format from text-based content to one better able to address the EXI requirements, an EXI implementation is better able to perform the translation back and forth between these layers.  It does so using a much reduced footprint from both a storage and a processing perspective.  Diego reports to me that in his brief experiment, a 60KB CDA document is compressed at a ratio of 20:1, which nearly makes up for the 1-2 orders of magnitude size increase.   Diego plans on performing other tests to evaluate performance.

What I have been able to determine from the W3C bake-off comparing the various technologies that they considered, and from the vendor's website for the technology that "won" is that you can also expect somewhere between 1-2 orders of magnitude improvement on processing (parsing) speed.

What happens next?  Having reached the point of becoming a standard, people are going to want to start implementing this in their products.  There are already 2 open source and one commercial implementation of the standard available. I think you can count on EXI being incorporated into your favorite XML parser pretty quickly.  Java implementations will probably be available sooner than C++/.Net.  Once web servers and browsers start supporting this technology, it will be interesting to see what it does to the browsing experience on sites that support it and which exchange XML or XHTML.   

One of the nice features about the EXI standard is that you can enable others who communicate with you to take advantage of it quite readily.  It plugs into the communications stack at the content encoder/decoder. At least one of the commercial products out there EXI enables your protocol stack.  You won't get all of the benefits of using EXI, but at least your communication partners could.  For XML based Web Services, this is a no-brainer feature to support.  Just make sure your client and server technologies support EXI on the stack and support use of x-efi as an encoding in Content-Encoding and Accept-Encoding headers.


0 comments:

Post a Comment