Monday, August 15, 2011

Documents vs. Data

In "The XML Consensus is breaking down" Grahame Grieve distinguishes three camps, heavy engineering crowd, the internet mob, and the data dictionary crowd.  He discusses how XML seems to be failing to bring these crowds together.

I've worked with structured documents and natural language processing for a long time, probably twice as long as I've been in healthcare.What I find interesting in healthcare is the nature of the information being used. Just for fun, I'm going to look at it from a document oriented perspective, since that's what I've spent most of my life working with.

From a document perspective, a dictionary, phone book or thesaurus is almost ALL data.  Every last dot and twiddle can be expressed as either discrete data, or relationships to other data.  In the healthcare world, this is like a drug database, or a vocabulary like RxNORM, SNOMED® CT, LOINC®, ICD or CPT.  In XML (or SGML), there's a lot of markup expressing relationships between content, as compared to text to display to the end user.

A clinical note, especially ED, Inpatient or progress notes, involve quite a bit of storytelling.  Sections like "History of Present Illness" or "Hospital Course" tell a story about what led up to a visit, or what happened afterwards.  The relationships are fuzzy and ill-defined, often hard to identify.  There is data here, but it has a great deal of subtlety.  It's hard to pin down, or to compute with (you will note that I said, hard, not impossible).  There are little pockets of low hanging fruit.  You can readily identify diseases or medications in the text, treatments or diagnostics.  The medical profession uses language precisely (some would say "like a scalpel"), which makes it a lot easier to identify this data in the narrative.  But just because it cannot be identified, doesn't mean it's not there.  It's just represented in a way that you cannot easily fit into third normal form.

Interspersed throughout the narrative are larger chunks of discrete data, results from examinations and assessments that can be finely structured, collections of measurements, like vital signs.  These notes are like entries in an almanac.  There are broad swaths of narrative, interspersed with finely structured text dense with data:  Flag, Population, GDP; Blood Pressure, Temp, O2 Saturation.  This stuff is designed to be put into tables, both on the screen and underneath it.

The electronic health record is a collection of data.  Some of it is stored in a format suitable for humans to read, and other parts of it stored in a way that makes it easier for software developers to compute with.  It's all data, the question really is what are the tools you use to access it.  The EHR is one such tool.  From the physician perspective, it ought to be digital library of data, making it easy to access information about a given patient, disease, medication, treatment, diagnostic, clinical trial, outbreak, et cetera.  Having accessed it, now display it, analyze it, and otherwise compute with it.  That's a lot to ask for from one application.  Mapping, charting, searching, keeping up with current events.  Heck, that's a lot to ask for from one standard. 

Not everything that is data is easily stored or computed.  Not every stream of data need be stored in human readable form.  Yep, XML might be the answer, but I think there are a heck of a lot more questions.

I used to work with almanacs, thesauruses, dictionaries and encyclopedias as structured text.  Each one had rich markup that worked very well inside the book.  In other words, the markup of a single publication worked well with data structures designed in that publication.  But getting them to work together across publications, that was very different, and a difficult prospect.  We still struggle with this in the "semantic web" today.  It shouldn't be a big surprise that the XML representations across these different viewpoints of healthcare might work well from that one perspective, but not from others.


Post a Comment