Thursday, July 12, 2012

An XSLT Breadth-First Search for Processing XML Schema (TL;DR)

Last night after spending several hours working with the HL7 HQMF ballot content, I started getting frustrated with the content structure.  The HL7 Publication DTD (yes, I know) is rather roughly documented.  It's based on the W3C Publication DTD, and as in many things in IT, "nobody knows how that works anymore".  Actually, I know just the person who does, but he's retired into his second career, taking pictures of birds.

One of the limitations (and a not unreasonable one at that) is the level of nesting allowed.  They only number sections to the fourth level.  After that, it starts getting funky (and it looks bad too).  So, I want to navigate through the R-MIM Walkthrough, and I just can't organize things the way I want.  One of the problems is that the document has been edited by several people, and the outline wasn't consistent.  Since this section was basically a walk of the DAG (directed acyclic graph) that is the RMIM, I figured I could probably automate generating the Table of Contents.

So I got someone to give me the latest schema generated by the HL7 tools, and started writing a stylesheet to traverse the schema and generate the outline.  Well, that was daring.  After all, here we are in the 11th hour (or perhaps at this point the 35th hour), and I want to completely reorganize a major chunk of the documentation.  Well, needless to say, I didn't get any sleep last night, but that's not because I reorganized things, but rather because of what I discovered when I did finally succeed.  I'll talk about that later.  What I really want to discuss today is this cool stylesheet that I wound up building.

You see, it walks the Schema, starting with the complexType that is used for the root of the HQMF document schema.  It then identifies each Participation, Role, Entity and ActRelationship in the tree, and generates my outline.

Here's the general structure I worked out.  It isn't ideal, but it works and people can navigate it.

1.1 Act
1.1.1 Act Attributes
1.1.1.1 Act.attribute1
1.1.1.1 Act.attribute2
1.1.2 Act Participations
1.1.2.1 Act.participation1
1.1.2.2 participation1.attribute1
1.1.2.3 participation1.attribute2
1.1.2.4 participation1.role1
1.1.2.5 role1.attribute1
1.1.2.6 role1.attribute2
1.1.2.7 role1.player1
1.1.2.8 player1.attribute1
1.1.2.9 role1.scoper1
1.1.2.10 scoper1.attribute1
1.1.3 Act Relationships
1.1.3.1 Act.relationship1
1.1.3.2 relationship1.attribute1
1.1.3.3 relationship1.attribute2
1.1.3.4 relationship1.act2
1.2 Act2
 ... and so on

The focus in the HL7 RIM is on the acts being documented, and this documentation structure keeps the attention on the acts, and doesn't exceed a nesting depth of 4, so I was pretty happy.

So I began writing the stylesheet.  After about 45 lines, I had everything by the recursion back to Act all working.  There were 6 templates in the whole thing.  One called Act, another Participation, another Role, another Entity and a final one called ActRelationship.   I knew I was going to have to break cycles in my traversal, because sections can have sections and so on.  So I wrote a little bit of code to detect the cycles, and the first thing that happened was a stack crash.  After a few moments adding some <xsl:message> elements, I was able to see that my cycle detection only worked for cycles of length two (after all, I knew there weren't any longer cycles in the R-MIM).

As it turns out, I was wrong.  The R-MIM designed allows for something called a Choice box in which you can include multiple model elements.  Then you can attach relationships from the choice box back to itself.  This is such a common thing in HL7 V3 that there's even a special shape for it in the tools.  The problem there is that if you have such a choice box that can go back to itself, the cycle length increases.  Say you had two items, A and B, and a link between them through C.  Now instead of having to detect this loop:  ACA, you also have to detect ACBCA and BCACB.  The more items you put in the choice box, the longer your loop detection has to detect loops for.

OK, so I tried the next best thing which was, as I iterated over things and went down the stack, to pass a list of where I'd been.  Well, the problem with that (besides being clunky), was that it just didn't work.  There were too many pathways through the cycles, and you could get into a real mess.  Now I was really stuck and it was around 10pm.  I recalled a post by Michael Kay (author of the XSLT Programmers Reference) somewhere about loop detection in XSLT, so I dug out my second edition.  No good.  It's described in the third edition, and he finally got the code right in the fourth edition (neither of which I have or needed until last night).  And he wrote it in XSLT 2.0, which is no good for me, because I'm still working in XSLT 1.0 (I know, I'm a glutton for punishment).

So after an hour of digging around trying to find a DFS or BFS (depth- or breadth-first search) implementation in XSLT I gave up, made some coffee and took a walk.  Then I went back to my desk, wrote the recursive BFS code down a couple of different ways, and then made it tail recursive.  Tail recursion is a good thing to do when you are writing in XSLT, because you can overcome a lot of limitations that way.  XSLT doesn't let you change the values of variables once you set them, but a tail recursive call lets you change the input parameters, and it can be optimized by a good XSLT parser into loop.

Here's the basic algorithm:

process(node)
{ ... whatever you want to do ... }


BFS(list todo, list done)
{
    if (!empty(todo)) {
      head = car(todo);
      tail = cdr(todo);
      newNodes = nodesReachableFrom(head);
      needsDoing = (newNodes - todo) - done;
      process(head);
      BFS(todo + tail, done + head);
   }
}


BFS(root, null);

You will note that I never modify a variable after setting it.

This is how it looks translated into XSLT.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema"
>
  <xsl:key name="node" use="." match="xs:complexType/@name"/>
  <xsl:key name="neighbors" 
    use="//xs:complexType[.//xs:element/@type = current()]/@name"
    match="xs:complexType/@name"/>

  <xsl:template match="/">
    <xsl:call-template name="BFS">
      <xsl:with-param name="todo"
         select="key('node','POQM_MT000001UV.QualityMeasureDocument')"/>
    </xsl:call-template>
  </xsl:template>
  
  <xsl:template name="process">
    <p><xsl:value-of select="."/></p>
  </xsl:template>
  
  <!-- BFS generates an XML fragment containing the keys
    of elements which need to be processed in the order they
    should be handled based on a breadth-first search of the 
    tree represented 
  -->
  <xsl:template name="BFS">
    <!-- todo is a node-set() containing a list of all nodes that have 
      yet to be processed -->
    <xsl:param name="todo" select="/.."/>
    <!-- done is a node-set() containing a list of all nodes that have
      already been processed -->
    <xsl:param name="done" select="/.."/>

    <!-- If todo is empty, we don't do anything -->
    <xsl:if test='count($todo)!=0'>
      <!-- head is the first node in todo in document order -->
      <xsl:variable name="head" select="$todo[1]"/>
      <!-- tail is the rest of todo in document order -->
      <xsl:variable name="tail" select="$todo[position() != 1]"/>
      
      <xsl:variable name="reachable" select="key('neighbors',$head)"/>
      <xsl:variable name="needsDoing" 
        select="$reachable[not(. = ($todo|$done))]"/>

      <xsl:for-each select="$head">
        <xsl:call-template name="process"/>
      </xsl:for-each>

      <xsl:call-template name="BFS">
        <xsl:with-param name="todo" select="$tail|$needsDoing"/>
        <xsl:with-param name="done" select="$done|$head"/>
      </xsl:call-template>
    </xsl:if>
  </xsl:template>
</xsl:stylesheet>

This is pretty generalizable to any problem where you need to traverse a DAG without crossing links.  A couple of notes are in order:
<xsl:key> and key(N, V) are XSLT elements and functions that are designed to do element lookup by key values.  These are worthy tools to have in your kit if you do a lot of XSLT programming.  Larn em!
I've set up two keys, one called "node" which returns the @name attribute of the xs:complexType element that has the name you specify.  That's a very straight-forward use of key in XSLT.  However, the second one called "neighbors" is much trickier.  It uses the key mechanism to return the set of xs:complexType/@name values that are needed by the elements in the named complexType.  For what I'm doing, I'm only interested in elements, so this isn't as hard as it could have been.

The reason that these return the @name attributes instead of the xs:complexType element is because the node-set returned can then be used as an argument to the key() function.  I won't go into all the details. You can use other problem-specific logic to find the "neighbors" of your node.

The next bit is also tricky.  The todo variable is a node-set, that I'm treating as a list.  $todo[1] is the head of the list.  $todo[position() != 1] is everything but the head.  So I have a built in CAR/CDR functions (remember LISP?).

Finally, given that you have two lists of items, how do you select items in the first list that aren't also in the second list.  This is how you do that: $first[not(. = $second)].
Where most people go wrong is in writing: $first[. != $second].  Since . and $second are node-sets, they use node-set comparision logic.  X = Y is true for node-sets X and Y if the string values of any to nodes in X and Y are the same.  X != Y is true if there is a node in X whose string value is not = to the string value of a node in Y.  If you don't believe me, read the spec.  Anyway, I showed you this little trick a few weeks ago.

This works, but the output wasn't in the order I expected.  My list is really a priority queue.  The nodes in the node-set are processed in document order.  So, when I "remove" the first node from the list, I'm actually getting the first @name attribute that appears in the document.  That is NOT the order I'm adding them in however.  They get added in whatever order the Schema follows.

It's important to me that I process this stuff in BFS order, because it makes it easier to follow the documentation that way.  To fix that, I had to use another XSLT trick, and that was to keep the lists in document fragments, which I then converted to node-sets using the EXSLT node-set() extension.

BTW: It isn't clear to me yet whether BFS or DFS produces a better order for documentation, but either one can be made to work.

The final XSLT for creating my table of contents can be found here.

This took me about four hours to figure out.  It created more work for me because I was able to see what content was missing.  It also vastly improved the HQMF result because the content is generated from the HQMF artifacts, so I cleaned up a lot of naming errors introduced by a new version of the HL7 RIM and Datatypes (we were using an much older RIM and Datatypes R1.1 for HQMF Release 1.0).  And because I was able to fix those errors, I'm a lot happier about the ballot quality (although still not satisfied).  I'll probably feel better after I get some sleep.




0 comments:

Post a Comment