Wednesday, April 6, 2011

Using XPointer in CDA URI references

I've written previously of use cases where the HL7 CDA specification supports Natural Language Processing capabilities.  One of the challenges in natural language processing is that you can have conflicting parses of the same sentence structure, resulting in overlapping spans of text which have alternative representations.  There are also cases where non-contiguous spans of text provide evidence for information (especially in sentences using lists and conjunctions).  I ran into this problem previously in the XML world, but fortunately for me, I worked in the office next store to one of the guys writing a W3C standard that would help.  It is a problem that is quite adequately solved these days using the XPointer standard.  XPointer is essentially an extension of the W3C  XPath languages.  It allows a URI Fragment identifier -- you know, that part of the URI after the hash sign (#) to be written as an XPath expression and has some additional XPath extensions.

To link an observation element in the CDA document to the text which generated it using NLP (another use case for linking entries to content), you would create an XPointer expression which pointed to the text that provided the evidence for it.  An example appears below.

Patient denies alcohol or tobacco use.

In this example, only the first and last parts of the sentence provide evidence for the specific observation "Patient denies tobacco use".  The middle words are NOT part of the textual evidence supporting that observation.  A second observation on the denial alcohol use would different parts of the text:

Patient denies alcohol or tobacco use.

There is no way to mark up the spans of text simultaneously  to support both observations using spans enabled by the CDA content tag and simple URI fragment identifiers.  One possible markup that could work using XPointer is:


The XPointer expressions become #xpointer(#id1|#id3|#id4) and #xpointer(#id1|#id2|#id4).  A challenge that a developer recently reported to me with this example is that when the XPointer expression was included in the XML and his XML parser complained that it violated the rules for the anyURI data type.  Specifically it reported:

error: cvc-datatype-valid.1.2.1: '#xpointer(XPointer expression)' is not a valid value for 'anyURI'.
error: cvc-attribute.3: The value '#xpointer(XPointer expression)' of attribute 'value' on element 'reference' is not valid with respect to its type, 'url'.

So, what is wrong here?

The problem is fairly straight-forward and has to do with W3C Schema constraints on anyURI.  Most specifically, the production for anyURI requires certain special characters appearing in parts of a URI to be escaped according to RFC-2369.  The specific rules for escaping characters can be found in the W3C XLink specification in the section on link-locators.  Appropriately escaping the special characters in the XPath expression will remove the error.

In the examples above, the special characters are the hash-marks (#) inside the XPointer expression and the pipe-characters or vertical-bars (|).  The parenthesis are actually OK.   So # must be replaced with %23 and | must be replaced with %7c in order to make the fragment identifier legal as far as the production rules for anyURI.  The new (and now fragment identifiers instead of XPointer expressions) are:   #xpointer(%23id1%7c%23id3%7c%23id4) and #xpointer(%23id%7c%7c%23id2%7c%23id4).  These may ugly but should work.

Most programming languages provide a language specific function to escape text appearing in a URL (e.g., URLEncoder.encode(String) in Java or HttpUtility.UrlEncode in .NET) .  I recommend using those functions to escape anything after the #xpointer string in the URL fragment as they should return a correctly escaped string. If you let it escape the initial # as %23, you will have a different problem because that character is what identifies the following text as being part of the URL fragment identifier.

0 comments:

Post a Comment