I got my start with HL7 CDA while working with a company that did extensive natural language processing. And before that, I worked with a company for 9 years that developed linguistic software (spelling, grammar, search enhancement) and licensed SGML and XML based content and publishing tools. One of the discussion topics at AMIA this year was on support for Natural Language Processing and CDA, and how that could be supported in CDA. Now I know that many have expressed interest in this area over the years, including some folks at the VA in Utah, others at the University of Utah, others working at IBM have written on it, and several vendors who are doing NLP based processing for healthcare and working with CDA (almost all of whom are involved in the Health Story implementation guides these days).
I can tell you that CDA Release 2 is a good but not ideal fit for Natural Language processing tools, and that I hope that CDA Release 3 will be somewhat better. It was pretty easy for me to take the NLP output of a system that I had worked with and generate a CDA document, but CDA was not the solution of choice for direct NLP. It could be in R3, but not without some changes and extensions to datatypes and the RIM.
The way that most NLP systems working with electronic text work is by pipelining a number of annotation processes over the electronic text. Lower level processes feed into higher level ones, and produce sequences of annotations as they go. Some of the first processes might segment text into natural boundaries, like paragraphs and sentences, based on punctuation and other cues. The next step is identifying word boundaries (and actually, depending upon how you do it, they can be done in either order or simultaneously). Just identifying sentence boundaries would seem to be simple, after all, you can just check for periods and question marks and exclamaintion pts. Right?
This annotation can be kept in numerous ways, as can the text associated with it. Some systems keep the text whole, and annotate separately by using pointers into it. Others insert markup into the text (maybe even in XML format) with additional data. There are many levels of analysis, and subsequent levels build on prior ones (and may even correct errors in prior ones) . So, it isn't uncommon for the different layers of analysis to be stored as a separate stream of content. Sometimes a mixed model is used, where some structural markup used to identify words, sentence and paragraph boundaries, but deeper analysis is stored separately.
So, NLP representations can annotate inline or separately. The CDA R2 model is a little bit of both, where structural markup (sections, titles, paragraphs, content, tables) is inline, but clinical statements are stored in a separate section of the CDA document, and can (and often but not always) do point back to the original text to which they apply.
I highlight the term original text above advisedly, because certain HL7 data types support reference back to the original text for which a concept has been applied (and it is the originalText property of these data types). Two of these are the Concept Descriptor (CD or coded concept) and Physical Quantity (PQ). That makes it very easy to tie some higher level concepts or measurement back to the text in the document.
But other data types don't support that: Many numeric spans in text can be represented using PQ, but not all numbers in text are measurements of something, they may be ordinals, so PQ is not always appropriate. There are other important data types for which you want to link back to the original text, including dates (TS), URLs, numbers (INT, REAL), names of people (PN), organizations (ON), places (ADDR) and things (EN). Unfortunately, neither data types R1 or R2 support identifying the original text associated with these parts. What is interesting is that some of the more esoteric HL7 data types like PN, TELECOM and ADDR are more important in NLP as data types than they are elsewhere in computing. Some slightly more complex data types like IVL_* also need to support text, I can think of at least 3-4 cases where you'd see that used in narrative text.
While I mentioned dates above, I didn't mention times. That's because I'm cheating. A time without a date is a measurement in some time unit (hour, minute, second) from midnight. HL7 doesn't really have a "time" data type separate from that use of PQ as far as I can tell. So, a time can point back to orginal text (using PQ), but a date cannot.
So to fully support NLP in CDA Version 3, we need to fix data types in the next release to allow just about all data types to be able to point back to orginal text. For NLP use today, we can extend the XML representation of those types in either R2 or R3 to include an extension element that supports pointing back to the original text.
Those pointers back to orginal text are relative URLs local to the current document, so they usually look something like #fragment_id, where "fragment_id" is the value of an ID attribute on an element surrounding the text being referenced.
But then you run into ambiguity problems in NLP. The classic sentence "Eats shoots and leaves" has two interpretations as punctuated (and three if punctuation is known to be missing) . Is leaves a verb or a noun? It could be either. So there are two alternate parses of the sentence where one treats [shoots and leaves] as a compound subject, and the other treats the whole thing as a compound sentence with two components [eats shoots] and [leaves]. So, how do you mark that up to support both parses? There's no way to interleave these two parses of text with XML markup that signals both sets of boundaries at the same time. To do that well you'd need something like LMNL (pronounced luminal), which is probably better for NLP anyway, but let's stick with XML.
There are ways to annotate the phrase so that you can point to non-sequential spans using a URL fragment identifier. That requires using a pointer scheme in the fragment identifier, such as that defined by XPointer, or separately, a simpler restriction using this not quite IETF RFC for the xpath1() pointer scheme. Both schemes solve the multi-span problem quite nicely and the xpointer() scheme is supported in a number of open source processors (even for .Net). So as a way to point back into text, it solves the overlapping span problem pretty well so long as you don't have abiguity at lower levels than word boundaries.
So, we can identify the small stuff, but how can we then make clinical statements out of it? That's mostly easy. For Acts, which is really the workhorse of the RIM, you have act.text. Act.text can point to the narrative components that make up the clinical statement, and the clinical statement becomes an annotation on the narrative. But what about the other RIM classes. The two key ones that would be needed are Entity and ActRelationship, but Role is also important too. If I write:
Patient Name: Keith W. Boone
I'd want to be able to point back to the text for the role and the entity. There's no RIM solution for that, so again, an extension would be needed to encompass Entity.text, Role.text, and ActRelationship.text, so that these things could be tied back to narrative.
So, what will be better in CDA R3, if we are missing text in several RIM classes and originalText in data types? Well, if we go with XHTML to represent the narrative, and allow the XHMLT div, table, ul and ol tags to be treated as organizers (a section is an organizer) then we can divvy up the CDA document into two parts. The narrative is an ED data type, with an explicit mapping from certain XHTML element types used to structure the text into organizers. The other part are annotations on the text that are the clinical statements that can follow committee models without being broken up by the CDA document content organizers.
There are two ways we could go here: Inline or separate. Do we allow a clinical statement inside an XHTML text organizer to be a representative of what is said in that text? That hearkens back to CDA R1 structure, and that would eliminate some of the need for Entity.text et al, but may not be expressive enough for NLP requirements, and would have some other issues with representations of committee models. So, I think there is a component of the CDA document that contains the clinical statements, and the clinical statements point to the text.
That leaves another important problem, which is how to deal with context conduction that was traditionally handled throught the section structure of the document. That requires another post, and isn't really related to NLP processing.