Convert your FHIR JSON -> XML and back here. The CDA Book is sometimes listed for Kindle here and it is also SHIPPING from Amazon! See here for Errata.

Friday, November 19, 2010

Natural Language Processing and CDA

I got my start with HL7 CDA while working with a company that did extensive natural language processing.  And before that, I worked with a company for 9 years that developed linguistic software (spelling, grammar, search enhancement) and licensed SGML and XML based content and publishing tools.  One of the discussion topics at AMIA this year was on support for Natural Language Processing and CDA, and how that could be supported in CDA.  Now I know that many have expressed interest in this area over the years, including some folks at the VA in Utah, others at the University of Utah, others working at IBM have written on it, and several vendors who are doing NLP based processing for healthcare and working with CDA (almost all of whom are involved in the Health Story implementation guides these days).

I can tell you that CDA Release 2 is a good but not ideal fit for Natural Language processing tools, and that I hope that CDA Release 3 will be somewhat better.  It was pretty easy for me to take the NLP output of a system that I had worked with and generate a CDA document, but CDA was not the solution of choice for direct NLP.  It could be in R3, but not without some changes and extensions to datatypes and the RIM.

The way that most NLP systems working with electronic text work is by pipelining a number of annotation processes over the electronic text.  Lower level processes feed into higher level ones, and produce sequences of annotations as they go.  Some of the first processes might segment text into natural boundaries, like paragraphs and sentences, based on punctuation and other cues.  The next step is identifying word boundaries (and actually, depending upon how you do it, they can be done in either order or simultaneously).  Just identifying sentence boundaries would seem to be simple, after all, you can just check for periods and question marks and exclamaintion pts. Right? 

This annotation can be kept in numerous ways, as can the text associated with it.  Some systems keep the text whole, and annotate separately by using pointers into it.  Others insert markup into the text (maybe even in XML format) with additional data.  There are many levels of analysis, and subsequent levels build on prior ones (and may even correct errors in prior ones) .  So, it isn't uncommon for the different layers of analysis to be stored as a separate stream of content.  Sometimes a mixed model is used, where some structural markup used to identify words, sentence and paragraph boundaries, but deeper analysis is stored separately.

So, NLP representations can annotate inline or separately.  The CDA R2 model is a little bit of both, where structural markup (sections, titles, paragraphs, content, tables) is inline, but clinical statements are stored in a separate section of the CDA document, and can (and often but not always) do point back to the original text to which they apply.

I highlight the term original text above advisedly, because certain HL7 data types support reference back to the original text for which a concept has been applied (and it is the originalText property of these data types).  Two of these are the Concept Descriptor (CD or coded concept) and Physical Quantity (PQ).  That makes it very easy to tie some higher level concepts or measurement back to the text in the document.

But other data types don't support that: Many numeric spans in text can be represented using PQ, but not all numbers in text are measurements of something, they may be ordinals, so PQ is not always appropriate. There are other important data types for which you want to link back to the original text, including dates (TS), URLs, numbers (INT, REAL), names of people (PN), organizations (ON), places (ADDR) and things (EN).  Unfortunately, neither data types R1 or R2 support identifying the original text associated with these parts.  What is interesting is that some of the more esoteric HL7 data types like PN, TELECOM and ADDR are more important in NLP as data types than they are elsewhere in computing.  Some slightly more complex data types like IVL_* also need to support text, I can think of at least 3-4 cases where you'd see that used in narrative text.

While I mentioned dates above, I didn't mention times.  That's because I'm cheating.  A time without a date is a measurement in some time unit (hour, minute, second) from midnight.  HL7 doesn't really have a "time" data type separate from that use of PQ as far as I can tell.  So, a time can point back to orginal text (using PQ), but a date cannot.

So to fully support NLP in CDA Version 3, we need to fix data types in the next release to allow just about all data types to be able to point back to orginal text.  For NLP use today, we can extend the XML representation of those types in either R2 or R3 to include an extension element that supports pointing back to the original text.

Those pointers back to orginal text are relative URLs local to the current document, so they usually look something like #fragment_id, where "fragment_id" is the value of an ID attribute on an element surrounding the text being referenced.

But then you run into ambiguity problems in NLP.  The classic sentence "Eats shoots and leaves" has two interpretations as punctuated (and three if punctuation is known to be missing) .  Is leaves a verb or a noun?  It could be either.  So there are two alternate parses of the sentence where one treats [shoots and leaves] as a compound subject, and the other treats the whole thing as a compound sentence with two components [eats shoots] and [leaves].  So, how do you mark that up to support both parses?  There's no way to interleave these two parses of text with XML markup that signals both sets of boundaries at the same time.  To do that well you'd need something like LMNL (pronounced luminal), which is probably better for NLP anyway, but let's stick with XML.  

There are ways to annotate the phrase so that you can point to non-sequential spans using a URL fragment identifier.  That requires using a pointer scheme in the fragment identifier, such as that defined by XPointer, or separately, a simpler restriction using this not quite IETF RFC for the xpath1() pointer scheme.  Both schemes solve the multi-span problem quite nicely and the xpointer() scheme is supported in a number of open source processors (even for .Net).  So as a way to point back into text, it solves the overlapping span problem pretty well so long as you don't have abiguity at lower levels than word boundaries.

So, we can identify the small stuff, but how can we then  make clinical statements out of it?  That's mostly easy.  For Acts, which is really the workhorse of the RIM, you have act.text.  Act.text can point to the narrative components that make up the clinical statement, and the clinical statement becomes an annotation on the narrative.  But what about the other RIM classes.  The two key ones that would be needed are Entity and ActRelationship, but Role is also important too.  If I write:
     Patient Name: Keith W. Boone
I'd want to be able to point back to the text for the role and the entity.  There's no RIM solution for that, so again, an extension would be needed to encompass Entity.text, Role.text, and ActRelationship.text, so that these things could be tied back to narrative.

So, what will be better in CDA R3, if we are missing text in several RIM classes and originalText in data types?  Well, if we go with XHTML to represent the narrative, and allow the XHMLT div, table, ul and ol tags to be treated as organizers (a section is an organizer) then we can divvy up the CDA document into two parts.  The narrative is an ED data type, with an explicit mapping from certain XHTML element types used to structure the text into organizers.  The other part are annotations on the text that are the clinical statements that can follow committee models without being broken up by the CDA document content organizers. 

There are two ways we could go here:  Inline or separate.  Do we allow a clinical statement inside an XHTML text organizer to be a representative of what is said in that text?  That hearkens back to CDA R1 structure, and that would eliminate some of the need for Entity.text et al, but may not be expressive enough for NLP requirements, and would have some other issues with representations of committee models.  So, I think there is a component of the CDA document that contains the clinical statements, and the clinical statements point to the text.

That leaves another important problem, which is how to deal with context conduction that was traditionally handled throught the section structure of the document.  That requires another post, and isn't really related to NLP processing.


  1. Loved it!

    My guess is that external pointers has a strategic advantage in that it would be easier to support the hypothetical possibility that two cans over the text would produce overlapping fragments. For example an ambigyous phrase could have two different parses such as (the) (fruit) (flies) and (the) (fruit flies).

    This outcome may be more likely to arise as some now parse voice to facts directly, rather than voice to text to facts.

  2. Interesting post.
    Another problem is how to refer to disparate content elements in the Narrative Block of the Section.
    An NLP engine could e.g. reference different tokens in the Narrative Block (I used rectangular brackets):

    [content id="a1"]Eye pupil[/content] of the patient was [content id="a2"]dilated[/content

    [observation classCode="OBS" moodCode="EVN"]
    [code code="164022008" codeSystem="2.16.840.1.113883.6.96" codeSystemName="SNOMED CT" displayName="ON EXAMINATION - PUPIL DILATED"]
    [reference value="#a1;#a2"/]

    The originalText being of data type ED does not allow two reference elements. I'm pretty sure the above reference value is not a valid XML reference.

  3. See the 5th paragraph from the bottom.

    But for an example try:

    [reference value="#xpointer(#a1|#a2)"]

    That uses the XPointer() Scheme from the XPtr Framework.

  4. thanks for the info on Xpointer!
    If there is enough support for it I'll use it.

    Back to the NLP CDA discussion.
    CDA was just plain not designed to be used for NLP. The problem is that by trying to make CDA everything for everyone it will be harder to use it as middleware.

    Cluttering the NarrativeBlock with annotation tags will render it totally human unreadable, which is what I think it was designed for in the first place.

  5. That's why I think the annotations (Clinical Statements) need to be in a different component.


  6. With annotations I actually meant NLP information.
    You can't put them in the Clinical Statement, and they would clutter the text block.

    for example, when you talk about 'first pass' low level NLP annotations, an NLP engine could tag some of the free text in my example ("eye pupil of the patient was dilated") using a Snomed CT lexicon. another 'pass' could then do a grammatical analysis (where do we keep that information in CDA?).
    Let us say the 'first pass' gets 'eye pupil' and 'dilated' out of it. Combining the grammatical parse and post-coordination articulations of Snomed CT the NLP engine could come up with a post-coordinated 'dilatation of the eye pupil' snomed ct code.

    That's a lot of extra information to stick into the CDA document.

  7. If you want to put more detailed annotations for NLP in the CDA document that wouldn't be part of the final content, I'd put those in the value element of an Observation in the Clinical Statement, using the ED data type. That way you could use whatever XML markup you wanted for those details.

    What I'm trying to address as requirements are what you'd generate for the Final CDA rather than the interim processing steps.

  8. Thanks for the background. This is something which a lot of people look to as a panacea, but don't realize how hard it is.

    One of the biggest problems, and part of why I favor getting more and more structured data entry by physicians, is that we have a culture in medicine of saying things that have zero information content.

    Phrases like "cannot rule out the possibility that X is not present" don't convey information.

    Negation too can be critical (i.e. "the patient doesn't have any signs or symptoms suggestive of Ebola", but you need to be able to pair up the negation phrase, many words away, with the actual meaning. Wendy Chapman has a (open source!) NegEx tool which helps.

    We also need to optimize the NLP tools based on the document type and document sections. Here, having well defined document content (with more detain in the LOINC document ontology indicating which sections they may/should/shall contain).

    I did some work with MMTX several years ago when I was still working on a PhD. I found that configuration options made huge differences in overall performance, time to complete, accuracy, etc.

    I also looked at separate processing rules for different sections (after normalizing dictations from a dozen different facilities into a common document structure). It was clear that some additional section specific processing would be needed, particularly in the review of systems and physical exam section. Some people dictated like: "pulmonary symptoms: negative" or "cardiovascular symptoms: negative" etc.and others like: "negative for cardiovascular or pulmonary symptoms" and some went into specifics: "Patient denies chest pain, syncope, dyspnea, and cough".

    There is also a whole separate discussion of what "heart exam: normal" v. "heart exam: regular rate and rhythm, no gallops, rubs or murmurs" means.

    A big part of this is training physicians out of some of the bad habits. There was an editorial in the NEJM in the late 1960's on the "not uncertain language of medicine", and we still have it decades later.

    ROS and PE, are actually a lot quicker to do using a good computer UI. I think we may want to consider having multiple options for clinicians, do structured data collection, but also allow narrative when needed.

    What do we do with the results? Do they go back to the clinician for review? Do they get packed into the SNOMED-CT Finding (or Procedure) with Explicit Context so things like qualification (possibly present, probably absent) and context are preserved? (Please!)? Do they get added to the CDA as an annotation?

    Right now, NLP has shown a lot of promise, esp for very specific things (is the patient a smoker? Overall, I haven't seen results in getting a whole H&P correct as well. This is why I favor encouraging them to state a differential diagnosis, and use a small set of existential qualifiers (q.v. SNOMED-CT Finding context values) which lets them hedge, but only within prescribed limits. Getting more structure even in the dictation (e.g. using dictation templates) will help a good deal.

    The reward, is two fold. One: more structure and more expressive = better reimbursement as fewer EM codes get downcoded due to lack of documentation. Two: we have to feed data back to physicians so they can look at their own results. This later is probably as big a motivator as the first. Get physicians (and other clinicians) interested in, and using, the data they generate, and suddenly structured documentation goes from being a chore to being something they "get" and will support.

    There are some very smart people working on how to deal with this unstructured content, and I hope (and pray) that they can crack this problem open. I still think the hardest part is going to be how we teach physicians to speak, esp. when it comes to expressing their opinions in reports / consults. Plenty of work to be done.