Wednesday, November 30, 2011

Loading I2B2 from CDA Documents

As part of my evaluation of models for I2B2, hQuery and CIM, I decided to map from CDA Release 2.0 using  the business rules applied in C83 and the CDA Consolidation guide to the I2B2 Star Schema.  The point of this exercise is to show how an I2B2 data repository would be populated from a collection of CDA documents, and as a result, build the mapping between the I2B2 model and C32 (which also leads to hQuery, since it's model is based on the C32).  While I've based this work on C32 and the CDA Consolidation project, the rules are general enough that they can be applied to a variety of different CDA documents, and they need not conform to the templates for those guides.  The I2B2 Data Repository design documentation (pdf) was essential to this work, and I wish I'd had it when I started on my SQL Proof of concept.  Oh well, I'll have to go back and rework that one later, and it's my fault for not catching up on the summer concert listening.

Here's a table showing my initial mappings.  The first column indicates the I2B2 fact or dimension table.  The second column indicates the field.  The third is an XPath expression giving either the context for the table (for table heading rows), or the data element (relative to the table context element) that appears within the table field.  XPath expressions using the cda: namespace identifier can be found in the CDA schema.  Those with the rim: namespace identifier represent extensions defined by HL7 SDWG on behalf of HITSP to represent the field.  The last column describes either the table or the field within the table based on the I2B2 documentation

To load a CDA document, one would iterate over each document, stopping at the table context points, and create a row of data using the field specifications.  Then each fact or dimension table would be loaded from the unique rows produced.  This an overly simplified description of the algorithm (table load order is important for referential integrity), and that are lot of other details I'll get into later.  First, let's look at the (somewhat simplified) mapping:

Table   Field CDA I2B2 Definition
Observation cda:act|
cda:observation| cda:substanceAdministration|
In healthcare, a logical fact is an observation on a patient. It is important to note that an observation may not represent the onset or date of the condition or event being described, but instead is simply a recording or a notation of something. For example, the observation of ‘diabetes’ recorded in the database as a ‘fact’ at a particular time does not mean that the condition of diabetes began exactly at that time, only that a diagnosis was recorded at that time (there may be many diagnoses of diabetes for this patient over time)
Encounter ID ancestor-or-self::cda:*[@classCode='ENC']
patient visit number
Patient ID //cda:patientRole/cda:id patient number
Concept Code @classCode or cda:code Code for observation of interest (i.e. diagnoses,
procedures, medications, lab test)
Provider ID ancestor-or-self::cda:*[@typeCode='AUT' or @typeCode='PRF'][1]/cda:*/cda:id Practitioner id or provider id
Start/End Date Range cda:effectiveTime Starting and ending date-time of observation
Modifier (computed) Code for modifier of interest (i.e. “ROUTE”, ”DOSE”), note that value columns are often used to hold the amounts such as “100” (mg) or “PO"
Instance ID cda:id Encoded instance number that allows more that one modifier to be provided for each concept_cd. Each row will have a different modifier_cd but a similar instance_num.
Value Type cda:value/@xsi:type Format of the concept
N = Numeric
T = Text (enums/short messages)
B = Raw Text (notes/reports)
NLP = NLP result text
Value cda:value
Location Code ancestor-or-self::cda:*[@typeCode='LOC']/cda:*[@classCode='SDLOC']/cda:id A location code, such as for a clinic
Patient //cda:patientRole Each record in the patient_dimension table represents a patient in the database. The table includes demographics fields such as gender, age, race, etc. Most attributes of the patient dimension table are discrete (i.e. Male/Female, Zip code,
Patient ID cda:id
Vital Status (computed) Contains a code that represents the vital status (alive or dead) of the patient and the precision of the vital status data.
Birth Date cda:patient/cda:birthTime
Death Date cda:patient/rim:deceasedTime
Gender cda:patient/
Age (computed)
Language cda:patient/
Race cda:patient/cda:raceCode
Marital Status cda:patient/
Religion cda:patient/
Zip Code cda:addr/cda:zip
StateCityZipCode cda:addr/(cda:state|cda:city|cda:zip)
Provider //(cda:author|cda:performer) Each record in the provider_dimension table represents a physician or provider at an institution. The provider_path is the path that describes how the provider fits into the institutional hierarchy. Institution, department, provider name and a code may be included in the path
Provider ID cda:id
Provider Name cda:name
Encounter //cda:*[classCode='ENC'] The visit_dimension table represents sessions where observations were made. Each row represents one session (also called a visit, event or encounter.) This session can involve a patient directly, such as a visit to a doctor’s office, or it can
involve the patient indirectly, as in when several tests are run on a tube of the patient’s blood. More than one observation can be made during a visit. All visits must have a start date/time associated with them, but they may or may not have an end date. The visit record also contains specifics about the location of the session, such as the hospital or clinic the session occurred, and whether the patient was an inpatient or outpatient at the time of the visit.
Encounter ID cda:id
Patient ID ancestor-or-self::cda:*[
  typeCode='SBJ' or
  typeCode='RCT' ]/cda:*/(cda:id|rim:id)[1]
Active Status cda:statusCode
Start/End Date cda:effectiveTime
Encounter Type Code cda:code
Location Code ancestor-or-self::cda:*[typeCode='LOC']/cda:*[classCode='SDLOC']/cda:id

Now for some comments on it...

Concept Codes
You'll need to look at both the "act" classCode attribute, and maybe the code element within the act, and map that to the I2B2 ontology to figure out how to populate the concept code.

Modifier Codes
In I2B2, a single fact can have multiple parts.  Each part of the fact is identified by the Instance identifier, and the part being represented (e.g., medication, Dose, route or frequency for a medication) can be separately represented.  In CDA, the "fact" is represented by one of the basic "act" classes, and the properties of that class represent each of the fields.  So, some acts will need to be represented as several facts (e.g., medications), while others (e.g., a lab result), will just be represented as a single fact.  This shouldn't be too hard to understand.

Value and Value Type
I2B2 has four different basic value types.  CDA has a few more that need to be mapped into the SQL tables.  Also, I2B2 has different columns in which each value type is placed.

Location Codes [sic]
In the I2B2 schema, location codes really identify specific locations, and so are identifiers, not codes.  Thus my mapping to cda:id for a specific location.  Locations are set in the document context for each observation, and apply unless overridden later in the document (a rare occurence).

A CDA document is "documenation of" an "encompassing encounter".  Usually, what is recorded in the  document with respect to the encounter and its location applies to everything in the document (it's part of the context of the document).  That could be overridden subsequently in the document, indicating that the fact was a component of a different encounter that had a different location participant, but again, that is usually not the case.


Usually, the "author" of the document is also the performing provider, but again, that can be overridden with a performer participant in the encounter (there are several types of performers as well).

So, if you wanted to load a CCD document into an I2B2 data repository, this is enough to get you started.

My next task is to look at the NQF HQMF documents created by the Measure Authoring Tool, and see what I interpreted incorrectly, and see how well my transforms work against it, and comment on its structure.  While HQMF may be the right standard to represent queries, we will need implementation guidance given on how to represent queries in the Query Health environment.  The IHE Quality Measure Definition (ftp to Word document) profile might be one source for that guidance, and I've been drafted to help on that profile.  I'll  certainly be taking what I learn from this project into that one.


  1. Keith,
    This post was very timely for me. We are working on a pilot to capture CDA documents registered in an IHE compliant HIE, extract the discrete data in these documents and populate an i2b2 instance to support research queries. We have been playing around with the ONC Meaningful Use, HITSP C32/C83 Example CDA documents (can't seem to figure out where I got these). My question is in the case of a drug allergy to penicillin that contains an entryRelationship that documents the "assertion" of hives … how would one store this in i2b2? Would the hives be some type of modifier on the primary entry of the penicillin allergy stored in the fact table? Any thoughts are greatly appreciated.

    Mike Blechner
    University of Connecticut

  2. Minor nit - a CDA document is "documenation of" a "service event"; it is a "component of" the "encompassing encounter". The "performer" is a participant in the service event; there are other clinical participants in the encompassing encounter, who may be i2b2 Providers, but who may be external to the institution (e.g., referring physician).