Monday, September 21, 2015

A long long time ago ... (XML from Word part 2)

... continuing from XML from Word.

To begin your unflattening, you will have to prepare a piece of data to explain what the structure of the final output needs to look like.  If you are simply unflattening HTML or Word using heading numbers, this is fairly straightforward.  If your document has a good bit more style and structure, you may need to do a bit more work.  Assuming you have a good XML Editor (just about any decent one can do this next step), you should be able to produce an XML Schema from a sample XML document.  The schema will suck, looking something like this:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
  <xs:element name="body">
    <xs:complexType>
      <xs:choice maxOccurs="unbounded">
        <xs:element ref="p"/>
        <xs:element ref="h1"/>
        <xs:element ref="h2"/>
        <xs:element ref="h3"/>
        <xs:element ref="h4"/>
        <xs:element ref="h5"/>
        <xs:element ref="h6"/>
      </xs:choice>
    </xs:complexType>
  </xs:element>
   ,,,
</xs:schema>

Take the table of <xs:element> names and put them into another file somewhere, and add attributes that indicate the nesting level for each element, like this:
<table>
  <element ref="body" level="0"/>
  <element ref="h1" level="1"/>
  <element ref="h2" level="2"/>    
  <element ref="h3" level="3"/>    
  <element ref="h4" level="4"/>    
  <element ref="h6" level="5"/>    
  <element ref="h6" level="6"/>    
  <element ref="p" level="7"/>    
</table>
This table basically assigns a nesting level (or precedence) to each element name, so that you (or software) can figure out the nesting level.

Where the magic comes in is how I use it next to apply the structure.  You can do this sort of processing of a list of elements really easily in Java or JavaScript or C++ if you understand how to write parser for a language whose parse tree can be described with operator precedence.  But if you want to do this using XSLT, you'll need a lot of research, or a really twisted brain to figure this out. Fortunately for you, I just spent the last week in Portland, so my brain is already twisted after three days of Evidence Based Medicine at OHSU ;-).

To make this stylesheet work, you are going to need to run two passes over your XML (or have two separate stylesheets.  The first pass simply adds an attribute to each element that assigns it the precedence level from the previous document, and then turns this result tree into a node-set (via the EXSLT node-set extension function) and sends it to the next phase.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:exslt="http://exslt.org/common" 
  extension-element-prefixes="exslt" version="1.0">

  <xsl:output indent="yes" method="xml"/>
  <xsl:variable name="prec" select="document('precTable.xml')"/>

  <xsl:template match="/">
    <xsl:variable name="pass1">
      <content>
        <xsl:for-each select="content/*">
          <xsl:copy>
            <xsl:copy-of select="@*"/>
            <xsl:attribute name="text">
              <xsl:value-of select="."/>
            </xsl:attribute>
            <xsl:attribute name="_level">
              <xsl:value-of select="$prec/table/element[@ref=local-name(current())]/@level"  />
            </xsl:attribute>
          </xsl:copy>
        </xsl:for-each>
      </content>
    </xsl:variable>
    <xsl:apply-templates select="exslt:node-set($pass1)/content/Tabular" mode="process"/>
  </xsl:template>

What this does is basically run through each element child of <content> and add an _level attribute to that element.  It gets the element by finding it in the /table/element list, looking for one whose @ref attribute matches the name of the element.  Why do I do this step?  Locality of reference for the next ugly bit.  Basically, this is an optimization that makes the next optimization really shine.  My file has 35000 lines.  The algorithm that you might figure out for yourself in XSLT (if you can twist your brain around it) runs on the order of O(n3).  On my first attempt at an algorithm, I was looking for the children of each parent.  That lookup that I preprocess would be needed 42 trillion times if not preprocessed, and it doesn't run quickly since it is essentially a linear search.  Even with the optimized version below, this lookup is best not repeated if you can precompute it, so I do.

The algorithm I finally figured out after failing several times runs a lot faster. I estimate it is around O(n log n). I owe Jeni Tennison a beer if I ever see her again (and Steve Meunch), because I wouldn't have figured it out were it not for her post on his algorithm.

What I realized was that each element in the file has can have unique key computed which identifies its parent, and that key can be expressed in XSLT as the unique identifier of the first preceding sibling of that element whose level in the hierarchy is lower that then of the element.  You declare this in XSLT using the following line:

  <xsl:key name="parent" match="*"
    use="generate-id(preceding-sibling::*[@_level &lt; current()/@_level][1])"/>

Then, these next two templates do the magic restructuring:

  <xsl:template match="/" mode="process">
    <content>
      <xsl:apply-templates select="/content/*[1]"/>
    </content>
  </xsl:template>

  <xsl:template match="*" mode="process">
    <xsl:copy>
      <xsl:copy-of select="@*[local-name()!='_level']"/>
      <xsl:apply-templates mode="process" select="key('parent',generate-id())"/>
    </xsl:copy>
  </xsl:template>

That's a remarkably short bit of code for the magic it performs!  The first template simply kicks things off.  For each element the next template processes, it makes a copy of the XML (using xsl:copy and xsl:copy-of, and then inserts the content of all of the nodes which claim (through the parent) key to be its direct children.  If you instrument the output with <xsl:message> elements as I did when I first ran it, you'll see a BIG pause (at least if you run a 35000 line file through it), and then magically, everything will come out in a great big WHAM!

What is happening here is that first pause is the indexing stage, where the XSL process goes: "OK, he really does mean to use the parent key, I better go make an index." (Yes, I tend to anthropomorphize software).  Then it identifies every node (all 35000) of them, and executes the XPath expression in the use attribute.

generate-id(preceding-sibling::*[@_level &lt; current()/@_level][1])

That XPath expression says: for each preceding child whose level is less than mine, take the first one. Most XSLT processors are smart about any expression which ends in the pattern [number], especially when number is 1, or the expression last().  That usually means that the expression can be computed more efficiently and short circuited once the first item is found.  The indexing step likely has average case execution time of O(n log n).  Each element generates an index key.  The elements are found in O(n).  At the deepest layer, their are O(n) nodes, and it takes a constant time to find their parent. Their are O(log(n)) layers in the tree, and it takes approximately the same amount of time to compute their parent(less actually for balanced trees of breadth X [each node containing X children]).  The recursive processing step is O(n) once the index is precomputed. Putting all that together give O(n log n), which finally made this work without a week of processing time.

The real trick here was instead of trying to find the children of each node, turning the problem on its head and finding the parent of each child.  That is what makes the whole algorithm simple.

How does this apply to standards?  The file I was processing was a vocabulary table written in a giant Word document.



0 comments:

Post a Comment