Tuesday, December 27, 2011

Element Order Schema Important Not IS XML

A discussion on the IHE XDS Implementors Google Group spawned this question:

Why is order important in XML Schemas in cases where the order really doesn't matter with respect to data structuring requirements.  After all, the real issue is just that you have some number of child elements of a particular type.  Why should order matter?

There are a couple of answers to this question.

First is simply that the order is important when the schema says it is.  There are cases where the order of a collection of items has meaning.  This usually occurs in narrative.  Language is quite sensitive to order -- at least in "proper" construction, but as my wife often notes in our communication, "Order word important not is."

It does make sense to put the table header element (<thead>) before the body of the table (<tbody>), and it eases processing (it also simplifies table formatting to put the <tfoot> before the <tbody>).  There are also cases where order really doesn't matter.  Compare for example, dates in European locale to those in the US locale.  Today can be encoded as either <month>12</month> <day>27</day> <year>2011</year>, or <day>27</day> <month>12</month> <year>2011</year> or even <year>2011</year> <month>12</month> <day>27</day> without any loss of meaning.

In SGML DTD's there are three different operators for content models:

  • The comma (,) created lists (xsd:sequence in XML Schema).
  • The vertical bar (|) created choices (xsd:choice in XML Schema)
  • And the ampersand (&) created conjunctions where all elements needed to be present in any order (xsd:all in XML schema.

The ampersand operator was not actually supported by the XML DTD content specification.

Why would you make order important when there is no other requirement for it to be so?

Another reason why order is important is that it makes parsing XML easier to do.  Most XML Schema constructs can be parsed quite simply without any look-ahead using finite automata.  While the "xsd:all" construct can be readily converted to a data structure that can support parsing, you cannot use a finite automaton indiscriminately.  The number of states needed to support the "xsd:all" construct is on the order N! where N is the number of particles in the list of elements allowed. For example, in the date example given above: the first element could be year, month or day.  After that, there are two ways left to choose the next element, and then only one to choose the last.  See the list below.

  1. <Year>
    1. Y M D
    2. Y D M
  2. <Month>
    1. M Y D
    2. M D Y
  3. <Day>
    1. D Y M
    2. D M Y

XML (and XML Schema) is designed to be parsed and validated without using look-ahead because SGML (its predecessor) had the same constraint.  So parsers that deal with "xsd:all" typically keep a list of the particles, and do the validation that all of them were used no more than once afterwards.

Even so, it's much simpler to create a parser that doesn't need to worry about this sort of stuff.  This is why the & content model construct does not appear in XML 1.0 DTD content model, and was only reintroduced with XML Schema.

Another reason why order is important has to do with how elements are extended in XML Schema.  An complex type can be defined that extends another complex type by appending elements to the end.  This makes it easy for the parser to figure out what goes where.  Essentially what it does is create an xsd:sequence containing the content-model of the base type followed by the content model of the new type.  Which means that sequences extend naturally (because a sequence of two sequences is the same as the one sequence with all the particles of the two sequences put together in order), but xsd:all groups do not (becuase a sequence of two xsd:all groups is not the same as one xsd:all group containing the particles of the two).

Now, a brief note on how to create extensible Schemas.  The trick is to use wild cards.  You will typically have a complex type definition for the content of an item, and that will contain some sort of group (usually a sequence).

<xsd:complexType name="extendableElement">
  <xsd:complextContent>
    <xsd:sequence>
      <xsd:element name="foo" type="fooType"/>
      <xsd:element name="bar" type="barType"/>
      <xsd:element type="xsd:any" minOccurs='0' maxOccurs='unbounded' />
    </xsd:sequence>
  </xsd:complexContent>
  <xsd:anyAttribute/>
</xsd:complexType>

What the wildcard at the end does is allow any element to be included at the end of your sequence, or any attribute to be added to the extendable element.  You could include namespace='##other' to say that the element (or attribute) has to be from a namespace other than your schema's target namespace.  This is in fact what I proposed as being the best way to extend HL7 V3 XML these days.

And so, now you know why order is important in most XML Schemas, even when it is not.


3 comments:

  1. Great post, Keith. This seems like an approach to deal with the extensibility we lost going from V2 to XML, as long as you couple the convention with operating rules for dealing with the Any content (ignore it, don't create an application error). Obviously one would never to extension in attributes rather than elements.

    Here is a question back to you ... do we really get enough bang for the buck to justify using XML schema at all? It certainly has to be supplemented by xpath-based rules to begin to validate the content. Relaxing to simply well-formed XML with xpath-based validation would (a) make extensibility easier and (b) open up the possibility of using JSON.

    ReplyDelete
  2. "Unknown" above is Wes Rishel. I don't know why Blogger asked for my name and credentials and then called me "unknown."

    ReplyDelete
  3. Unfortunately that approach to extensibility only works if the extensions will be in a foreign namespace (which makes it useless as a forward-compatibility mechanism) or if the last element before the extension has both minOccurs and maxOccurs=1. In any other circumstance, the parser can't decide whether elements should be bound to the declared schema elements or to the anyType and will declare the schema as unreconcilable. That's why we didn't use this approach with 2.xml (as much as we wanted to.)

    It would have been *so* nice if the w3c had clearly declared that when encountering an xsd:anyType, the parser should treat all preceding element declarations as "hungry", meaning they will automatically bind to any instance elements they legally can and that the xsd:anyType only binds to the first element that can't bind to the declared elements (as well as all elements following that first element).

    ReplyDelete