Monday, August 1, 2011

Rethinking HL7 CDA Release 3.0

This blog post is for discussion at a future Structured Documents call on CDA Release 3.  It represents a departure from current thinking about CDA Release 3.0.  It's a continuation of my previous buzzword compliant post on the topic.

The point of departure starts from the idea of using a profile of XHTML as the content for the CDA Narrative block, and instead, replaces that with the idea of using HTML5 as the CDA Narrative representation, using the microdata draft standard (already supported to some degree in several current web browsers) to represent the machine readable data.

I will make an assertion, the proof of which appears below, which is that RIM models can be fully expressed in HTML5 Microdata.  HTML5 Microdata is just an implementation of property bags in HTML5, and property bags can be mapped onto any data model.  Trust me on this, at least for the rest of this article.  The proof, as a former colleague used to tell me, is just a matter of writing the code.

I would like to explore the benefits, and the restructuring of the CDA framework that I’m proposing in this post.

A Framework for Structured Healthcare Documentation
CDA is about Clinical Document Architecture.  This proposal uses HTML5 as a foundation for a Structured Document Architecture, and develops from that a foundation for a Clinical Document Architecture using an information model.  This foundation is based on a mapping of the CDA narrative content model to the HTML5 content model.
  1. The basic Level 0 structured document is simply a profile of HTML5.  It specifies a minimum set of tags that must be supported by an implementation, and the behavior of that implementation in the presence of tags that are not understood (just as HTML5 does).  Level 0 is a new level in the set of levels understood for CDA users.  It means that you just have the narrative content, and do NOT have any metadata.
  2. Level 1 structured documents introduce a set of metadata established first in the HL7 Structured Document Architecture (SDA) specification, with some appropriate modifications.  Level 1 structured documents describe the metadata functional requirements needed to support document metadata.  A conforming “level 1” instance of a structured document functionally supports the required metadata of the SDA, but DOES not include any specific representation of that metadata.
    1. All documents have an overall subject, identity, and publication time, and may have one or more authors or authoring organizations.  The subject of the document may be a person, place, thing, time, location, theme or author.  The subject is what the document is about.  The typical clinical document is about a patient.  The typical package insert is about a drug.  The typical guideline is about how to treat a particular disease.   From the HL7 Structured Documents workgroup perspective, this shouldn’t matter.  It is a document; domain committees can address functionally what the requirements of the subject are.
    2. Documents themselves are either singular or composite.  A singular document has no articles, uses the zero or more HTML5 section structures, and may contain a header, footer or navigation components (e.g., a Table of Contents).  A composite document contains one or more articles.  Each article follows the rules for singular documents, including its functional requirements (e.g., with respect to subject, identity, etcetera).  This allows for collections of documents (e.g. a package of medical records) to be organized into one HTML5 document, and to be extracted and repackaged into another for a different purpose.  An HTML5 transform from article to document and from document to article is straight-forward, since they would be defined as having identical requirements.
  3. An SDA Level 2 document would require metadata on not just documents, but also on sections.  That metadata would include for example, a code describing the content of the section.  
  4. An SDA Level 3 document would require additional metadata that provided clinical statements associated with the narrative in the document.  It would require conformance to SDA Level 2.  
Now for the traditional CDA Levels:
  1. A CDA level 1 document would require the presence of certain metadata (patient, author, document topic, et cetera) in addition to the metadata required of all documents.  But that presence would be defined functionally.
  2. A CDA Level 2 document would apply the requirements of SDA Level 2 to CDA Level 1.  It would add requirements for coding sections and providing metadata related to sections.  Again, that would be defined functionally.
  3. A CDA Level 3 document would apply the requirements of SDA Level 3 to CDA Level 2.  It would again add functional requirements.  This time, they would be that the document provide machine readable clinical statements from the appropriate domain model to represent the content of the narrative.
Information Models
Now to integrate information models.  Thus far, requirements on the presence of author metadata, patient metadata, et cetera, have all been defined functionally.  I can imagine four different information models that could support these functional requirements.  They are:

  • HL7 RIM based models; 
  • Green models (ala Green CDA), which are essentially domain specific models derived from a template metamodel, 
  • detailed clinical models, which are simply another domain specific model, and 
  • OpenEHR based models.  
Each of these models has different structures and rules.  Each of those different models can be used to meet the functional requirements for CDA Level 1-3 (or SDA 1-3).  What the heck, I could even use the CCR data model to meet those requirements.

As yet, we have not discussed wire format representations, just requirements and information models.  To get to representations, we need a microdata framework.  Most of that is already specified in the W3C microdata draft specification, but some profiling may be needed here.  There would need to be a mapping from RIM, Green, DCM, and OpenEHR models to an appropriate microdata representation.  I know how to handle RIM, and have some thoughts on how to handle data types R2, which I view as a two part specification:  Functional requirements on data types, and wire format representations in XML.  For microdata, I want to consider how to meet the functional requirements of the data types, and not necessarily to incorporate the XML representation.

Rules for defining microdata from XML Schemas
Some HTML5 elements are mapped directly to components of the information model.  In HL7, documents, sections, and text (and the subcomponents of text) would map quite well into HTML5.  For elements that don't map to HTML5 elements would be mapped to microdata.  Here are the basic of rules for mapping from schemas (a model) into microdata.
  1. An element in the schema becomes a microdata item.  In XML Schema, element names and type names associated with them are described using the XML QNAME.  In microdata, a name or a type can be represented by a URL.  
  2. To transform a QNAME into a URL, one simply concatenates the namespace URI, a #, and the element or type name to create a URL that can be used in the microdata propname or proptype attributes.
  3. An attribute becomes an item property.  The QNAME associated with the attribute is used as the property name and is transformed to a URL according to the rule above.
  4. Elements which don't have associated text in the narrative are represented using span tags containing no text with appropriate microdata elements at the end of the appropriate section.
Thus, a  <cda:substanceAdministration> tag would appear in an HTML5 document as an item where propname='urn:hl7-org:v3#substanceAdministration' and proptype='urn:hl7-org:v3#SBADM'.  Each attribute of the element becomes a new property using the element QNAME as the property name, and the value being the attribute value.  If the CDA element is an entry pointing to narrative text, then the text it points in the document becomes the element where the metadata appears.  CDA entries that don’t have an association with narrative text are transformed to empty <span> or <meta> elements at the end of the section, in the order in which they appear.

The Power of the Framework
Now, here is where I think the beauty of this lies:  A CDA document is defined based on certain functionality, and a singular standard (HTML5).  When we get into representation of content in the information model, every information model is connected to the clinical content in the same way.  The receiving system does not need to even understand the contained information model in order to compute with it, so long as the pieces doing the computing are aware.

There are still too many options for information models for the comfort of many people.  A number of folks are still very leery about the3 existence of both GreenCDA and CDA wire formats.  I’ve just added three other information models to those two (DCM, OpenEHR and CCR) and it could be worse.  It could support the CIMs created by the transfers of care work via the S and I Framework. How about Google Microdata definitions for People, OrganizationsProducts and Events?

We could even use what I call the blue model.  In the blue model, you have levels as well.  The very basic level is CDA level 0.  It contains just text in a <pre> tag.  This is basically worthless for anything other than displaying, or perhaps using a customized parse into a spreadsheet.  The next level puts some metadata on it using spans to wrap the different parts.  We keep adding levels until the necessary structure appears for full computability.  At some point, the pre tag can disappear and you have a CDA Level 3 with machine readable metadata and models.  Dump your blue-button text into the first, and what have you?  Something that everyone can read, but no computer can do anything with.  Keep moving the bar, and it becomes fully machine readable with well-defined semantics.

The power of this framework  is that it could support ANY model.  The problem with it is that it can support ANY model.  Why is this a good thing?  If you take it far enough, models can easily coexist.  I could have CDA and GreenCDA at the same time, or CCR and CDA.  I could even use the Google Microdata right alongside HL7 patient, organization, product and encounter models.  People are going to complain that this is ugly and not at all what standards are about.  We want a single international standard which makes everyone happy.  This suffers from so much optionality that nobody would be able to interoperate.  

Building the Standard from the Framework
But we haven’t built the standard yet, only a framework upon which we could build it.  My preferred approach would be to use RIM or a Green model for CDA.  But if the framework is right, we can share this infrastructure with others who want to use a different information model, and we would still be able to support some level of interoperability across all clinical documents.  At the very least, all these clinical documents would be human readable across the different models.  You might not understand the metadata present in the document, but you could at least read it.  For most browsers, this is exactly the behavior that we get today when we display an HTML5 document.

We still have too many options, so it is time to eliminate some of them.  As far as HL7 is concerned, for an HL7 CDA instance, the microdata must based either on the RIM, or on a Green representation of a RIM model.  I can build a RIM microdata ITS in my sleep, the rules are so simple.  I cannot do this for Green models today, although I’m told it will get easier soon with tools from both HL7 and MDHT.  So that’s another research project.  I'd stick with RIM for now pending the outcome of the other project.  And I can use them both!

Data Types
I haven't yet addressed data types.  These would be specified as a separate microdata ITS.  The reason for doing that is that I want to experiment with two different data type representations.

The first representation uses the data types and their components as full-blown microdata objects.  The second uses simple strings as much as possible.  Many (not all) HL7 Data types have a literal form.  A PQ is represented as a real number followed by optional whitespace, followed by an optional UCUM unit expression, e.g., 350mg.  A time interval can be represented as [198705122000;198705122130]. 

  • Literal forms are human readable and can be readily parsed and understood by humans and computers.  
  • They are shorter than forms which require each component to be separately identified.
  • They may actually be easier for implementers to use and parse.  

Unfortunately, code values (concept descriptors in HL7 parlance) do not have literal forms, but I have a cheat for that based on how I teach that data types (and would welcome other approaches).  Most uses of CD are actually CE.  The basic representation of a code requires a codeSystem and code.  This is really just an identifier for a concept in a particular namespace (a vocabulary).  The HL7 II data type represents identifiers using the literal form root:extension.  I’d just adopt that for CD, and maybe add some syntactic sugar to support codeSystemName and displayName.  Qualifiers are microdata properties where the name is the qualifier name, and the value is the qualifier value.

Having a data types R2 component Data Type ITS, and a literal Data Type ITS allows some experimentation to show which is easier to use, code for, et cetera.

Relationship to SAIF
This is either the sickest implementation of an HL7 SAIF architecture that I’ve ever seen, or the most beautiful.  I’m not sure yet.  We’ve defined SDA at four levels of conformance.  CDA adds functionally defined clinical document requirements to SDA at each of three levels (CDA Level 0 makes no sense).  The simple set of model mapping rules to microdata describes how to represent models in microdata in a general way.  Mapping from CDA Narrative content can be based on existing mappings with some additions to support new HTML5 features like section and hgroup elements.  The models could be just about anything, but we’ve noted two different possibilities as far as HL7 goes:  RIM and Green.  The data types could either be structured R2 data types, or literal form data types (I’m leaning this way).  The clinical content could come from any HL7 domain model, mapped to RIM and using the Microdata RIM ITS wither either data type mapping.

This represents clean separations for each part of the specification, governed along appropriate lines: SDWG for document structure, domain workgroups for clinical content, MnM and ITS for modeling and data type representation.  The pieces are readily separated into pieces of the larger CDA Release 3.0 project around responsibilities.  The general framework for creating CDA also provides a supporting pattern and infrastructure for subsequent modeling excercises.  We could easily develop SPL, Order Sets, and computable clinical guidelines with appropriate constraints on A) metadata associated with the document, and B) models to represent the machine readable content.  All of these could reuse the Microdata RIM ITS and data types ITS.

Platform Support
The cool thing about microdata and HTML5 is that is scriptable in a browser on multiple platforms.  A good set of scripts will allow someone to use Java or .Net based browser that support it to use the Microdata DOM API to access the microdata content.  These can be readily manipulated in a conforming browser to create a rich, CDA (or SDA) editing platform, using soon-to-be off-the-shelf technologies.  Forget VMR.  If I can give you a CDA in HTML5 with microdata, you will already have a programmable platform from which you can obtain clinical data, and upon which you can create and implement clinical decision support rules.

Several other projects come to mind, many pretty easy to implement: 

  • A Microdata to JSON/JavaScript transformation tool is like falling off a log.  A competent engineer with the right tools could build this in a day.  
  • A microdata to XML representation tool is also pretty easy, but might take two or three days.  If we define Microdata property models using QNAME to URL transformations, then the only real challenge is figuring out which properties are represented as attributes rather than elements.  
  • Microdata Type definitions could include metadata (as microdata even) that assisted the transformation in determining whether to represent microdata properties as attributes or elements.  It would certainly be easier to convert component based data types to XML, rather than the literals, but either still wouldn't be that hard.  We'd probably have to define that in some way.
  • You could also add a transformation from HTML5 to the appropriate RIM XML representations for CDA, or any other SDA-based document.

OK.  Now I’ve explored benefits.  What can shoot this down? 

Backwards compatibility
This representation is a dramatic departure from the current CDA representation.  Moving from CDA R2 to a microdata based R3 could be a challenge.  But just as there are many common XSLT transformations from CDA R2 to HTML, there are equally feasible transformations from CDA R2 to HTML5.  CDA R2 metadata is readily transformable to its microdata representation following rules similar to what I already described above.
A course alteration will delay CDA R3  
While this is almost certainly true, I believe the separation of labor possible with this architecture could actually make it easier to manage, and could even be done sooner.

A movement into untested waters 
Yes, HTML5 is not fully baked.  But many of the features of it are already supported in today’s browsers.  We have an opportunity now, while HTML5 is still being baked to experiment with it, and provide feedback to the W3C to make sure that it does get baked with any features we discover we need.

Dependency on someone else’s standards 
HTML5 and Microdata are being developed by the W3C and the WHATWG.  I don’t see this as an issue, rather it is a huge benefit.  The principals of HTML5 will be understood by the next generation of IT professionals by the time it is ready.  You can buy books from O’Reilly today on HTML5.  When you use general IT standards to solve healthcare IT problems, you can take advantage of the much broader market for those standards to get implementations.  XDS has 10 open source implementations because the underlying infrastructure is already widely available in open source.  As far as I can tell, it is the single most widely adopted HIE technology, and I expect it is because of that link to general IT.  Building CDA Release 3.0 on top of HTML5 could have an even greater benefit.


  1. I've only been working with CDA (CCD specifically) for a month now, but what I'd really like to see is a reference implementation to get data into / out of these documents. Parsing xml is tedious and painful. I think dealing with JSON would be easier, but HTML5 with a javascript api provided as a reference implementation would be pretty sweet too. Do you have any thoughts on reference implementations of CDA Implementations?

    1. Scott, not sure how far you've progressed on your effort. For this kind of work, clearly a slog, we've created a library here at Mirth we call CDAPI. It builds on the fantastic work out of the MDHT project and creates a DocumentFactory of sorts that can consume most CDA based documents into a JavaModel (our Mirth ClinicalDataModel or CDM) and can likewise take our CDM, however that was populated, and generate different valid CDA based documents (C32, C37, C62, etc).

      Drop a note if your interested...

  2. Not sure if you have the time Keith (I don't or I'd volunteer). However, it'd be really useful to see the existing CDA R2 example (or even just some portions of it) expressed in this new approach. That would give people something more concrete to evaluate.

  3. I have an old CRS/XDS-MS example that I can do this with pretty well. Should be pretty easy to automate.

  4. Cecil said...


    And there are some people out there working on writing the microdata to OWL transforms so this would be another option for machine processing that goes a step further in making the model and data approriate for decision support.

    The below sinppet from shows the owl:sameAs object property in OWL 2.0 expressed in microdata.

    <div class="related" item="item">
    <link itemprop="about" href=""></link>
    <div class="arcrole">sameAs</div>
    <a itemprop="" title="Virtual International Authority File" href=""></a>

  5. Cecil:

    Yep. Interestingly enough, Google's microdata definitions are also described in RDFa format.


  6. Hi Keith - I cross posted this link to Arien Malec's google+ thread... you might want to look into Jeni Tennison's writeup on ways to harmonize Microdata and RDFa, at least as far as capturing the intent behind RDFa. It's worth digesting; as is Hickson's associated commentary.