Wednesday, January 25, 2012

A Hex upon all your charsets

Today's post results from a question someone posed to me about how to render XDS-SD content using an XSLT stylesheet.  If you've tried to do this for application/pdf content, you've probably already discovered that there is no way* to use an XSLT stylesheet to render your content on either the server or the client side.

The critical issue for this morning's querant was how to render text/plain content, and that can be done (see this post), provided that you have a way to access a procedural language from your XSLT stylesheet (which almost all production level XSLT transformers do).  The solution is platform specific but is generally applicable.  In fact, if you are using Xalan, the solution works across most Java implementations and Xalan versions.  A C#/.Net solution is certainly possible along the same lines as the one I suggest.

But there is a challenge here, and that is determining the base character set of the text/plain content.  When that content is purely ASCII, it is likely to work on any system, because almost all character sets use the same first 128 characters as ASCII does (the main exceptions are EBCDIC, which you may never encounter, and UTF-16 or its older sibling UCS-2).  The problem only occurs when someone base-64 encodes text that contains extended characters (like any of these Å æ ç é í ñ ö ß û).  At that point, character set becomes critical for correct rendering.

Let's take Å for example.  In ANSI/ISO-8859-1/Windows Code Page 1252, this character is Hex C5, and is encoded in a single byte.  In UTF-8, this character would be encoded in two bytes, the first being Hex C3, and the second being Hex A5.  In UTF-16 or UCS-2, this character would be encoded in two bytes, the first being Hex 00, and the second being Hex C5, or Hex C5 followed by Hex 00.

These bytes will render in interesting ways on systems that aren't handling them right.  For example, the UTF-8 character represented by the sequence  C3, A5 will show up as Ã¥ in a system expecting ANSI/ISO-8859-1.

So, this matters for display purposes, and you need to know what character set the text/plain data was represented in.  So how do you figure this out?  The first thing to do is look at the IHE XDS-SD profile, but it doesn't say (because that's a property of the HL7 CDA standard that IHE didn't profile in XDS-SD).  nonXMLBody/text is of the ED data type in HL7.  When we look there, we see that data type has the charset property, so all we need to do it get that, and we're all set, right?  Wrong.

Unfortunately, while the data type covers it, the XML Implementation of the ED data type does not.  It says:
charset is not explicitly represented in the XML ITS. Rather, the value of charset is to be inferred from the encoding used in the XML entity in which the ED value is contained.
The problem with this statement is that it is entirely true when the ED data type is not base-64 encoded, it is false when it is so encoded.  At that point, the encoding is of text/plain content that is entirely independent of the character encoding of the XML document.  In fact, the character encoding of the XML document can change independently and without any disruption of the XML content (because XML is Unicode, no matter how the document is transmitted).  But the base-64 encoded text/plain content will not change its encoding because the characters are now those of the base-64 encoding.  The encoding of the base64 characters can change, but that won't change how the content they are encoding will decode.  Confused?  Yeah, me too.

My hope was this would be corrected in Data Types R2 ITS and/or Data Types R1.1, but apparently its been totally missed.

My advice for now:  If you are a content producer of XDS-SD documents using text/plain content is to use only ASCII characters in your content if you can.  I'll bet more than 95% of the documents using that format do that today in any case.  The long term solution will be for IHE to apply a CP to XDS-SD that says that for text/plain, the content must be stored in the _____ encoding, where the blank will be filled in after a good bit of discussion.  I would expect that UTF-8 would be a strong contender.  And in HL7 we need to fix the ITS, because charset on ED needs to be present when it is base-64 encoded.

If you are a content consumer, when you decode the text, look for any non-ASCII characters in it.  If there are some present, you are going to have to guess at the encoding.  If not, you can just stick with ASCII.

In the meantime, all I have to say is ?@*#$%~!

  -- Keith

* Actually there is a way described here, and it uses the data: URL described in RFC 2397, but this is not supported across all platforms and media types (IE for example, does not support the data: URL for all objects, only images).

0 comments:

Post a Comment