Thursday, December 29, 2011

Is XML Schema Worth it?

Wes Rishel asks whether XML Schema is worth it in a post yesterday.  Then he goes on to complain about its failings as a validation technology, and then segues to JSON.

There really are three different problems that Wes is discussing.  The first problem to address is whether data can be described in a way that can be easily parsed.  The second problem is whether the data that has just been parsed is properly structured.  The last problem is whether the structured communcation makes sense:

  1. How easy is this stuff to parse?
  2. Is it structured right?
  3. Does it mean something for my business?

The JSON/XML debate addresses the first question.  JSON is certainly easier to parse than XML. But there are a lot more tools for dealing with parsed XML afterwards than there are for JSON (at this point in time, that will change in the future).

Structure reverts back to Wes's first question, has XML Schema been worth it? I'd have to say that it has been.  From an industry perspective, there really is no denying that XML Schema provides a great deal of value across the IT domain, and without it, there's a lot of stuff we wouldn't have been able to do.  It hasn't been the simplest technology to work with.  But from Schema came web services (of the sort used within enterprises), information models, mappings from databases to XML and vise versa, and a whole host of other cool stuff.  Schema hasn't solved every problem that exists; as a specification language, it has its limits.  XML Schema is limited to creating and parsing structures that can be handled without look-ahead.  It makes it pretty easy to create context-free languages for communicating between systems.

On the JSON front, there was an attempt to create a JSON Schema language, but it seems to have died on the vine.  Unfortunately, the "JSON" crowd sees little need for schema, because JSON, RESTful and all the other Web 2.0 parts are intuitively easy to use and therefore validation isn't needed (that really is an oversimplification of the case).  I could spend a whole post on that topic (and probably will in the near future).

Business Meaning
Unfortunately, many problems require context-sensitive validation, which you cannot get from a simple context-free grammar production.  Co-occurrence constraints (like this code here implies that kind of code in that structure over there), are examples that introduce context-sensitivity into validation problems.  What XML Schema cannot do by itself can be assisted by languages like Schematron, which can do other things like co-occurrence very easily.  Schematron is a very easy to use XPath based validation language, but it is miserable at other tasks, like creating easily understood structures.  Context sensitive validation usually addresses things like business rules (this code isn't allowed to be used with that one), otherwise known as "edits" to coders.

In summary
In the end, to deal with these validation problems, it doesn't matter whether your parenthesis are shaped like this <> (XML) or like this { } (JSON) or even like this () (LISP).  Eventually, you WILL want to validate it (even if it is in JSON).

Tools like XML Schema can aid with that validation, but they don't and never really will solve the entire problem. It doesn't really matter what you do, or what tool you work with, because that last problem, addressing validation at the business level, is not a technology problem.  Business rules are created by people, and they defy logic and software algorithms.

-- Keith


  1. I'd like to say more on this topic later, but for now I'll mention that this discussion reminds me strongly of the shift in outlook on database design forced by clinical trial databases and EHRs, and by an increasing number of complex database problems outside biomedicine and health IT. I am referring specifically to what Prakash Nadkarni of Yale refers to as the The EAV/CR Model of Data Representation.

  2. What about XML-Schema 1.1?
    It combines both the classic XML-Schema with Schematron.
    In our standards development (CDISC ODM) we can enforce about 60% of the rules with XML-Schema, and 30% with Schematron. So if we had an XML-Schema 1.1, I guess we could enforce about 90% of the rules.
    Problem however seems to be that there are still very few validators that support XML-Schema 1.1.

    The problem with business logic is that it is not always clear: I have seen so many standards specifications that can be interpreted in one way or another, depending on the readers background, culture, ...

  3. It would have been preferable if XML-Schema 1.1 or OASIS CAM had come along before XML-Schema 1.0... Curious to see if CAM begins to bleed over into Health IT at points where Health and Justice meet.