Wednesday, January 13, 2021

Supplemental Data Part 2

One of the challenges with Supplemental Data is in dealing with how to communicate it.  One can send it

  1. in references only, expecting receivers to then retrieve those elements needed,
  2. in references only, having previously communicated the referenced data to the receiver individually,
  3. as contained resources within the MeasureReport,
  4. in a Bundle with the MeasureReport,
There are some challenges with each of these.
The reference only model requires the sender to provide a back-channel to respond to queries for specific resources, with all of the concomitant security issues that raises.  Sending an HTTP request out from inside your IT infrastructure doesn't require cutting holes through firewalls, receiving requests does.

The references with previously send model runs into problems with resource identity, eventually two different facilities will produce the same resource identifier.  And if the resources are just forwarded as is and assigned a new identifier by the receiver, then the measure source then has to reconcile those identities.

The collection of supplemental data can lead to large files, which for some, violates "RESTful" principles.  The idea in FHIR is to use fine-grained resources, but containment of supplemental data could make the MeasureReport rather large.  Let's do some estimating on size here:

One of the largest hospitals in the US is in New York City, and has something like 2000 beds.  Reviewing hospital inpatient capacity metrics (via HHS Protect) shows me that there's a range of utilization of inpatient beds ranging from 45% to 85% across the country.  So, if we are building a report on 85% of 2000 possible inpatients that would be 1700 patients being reported on.  Using a reasonable definition for supplemental data (e.g., what we are using in this Connectathon) let's count the number of contained resources: For each patient, we'd have something like a dozen conditions on average*, one or two encounter resources, the current patient location, maybe a half dozen medications*, a two or three lab results, an immunization resource, a DiagnosticReport, and a ServiceRequest.  Call it 30 resources on average for each patient.  Looking at the test resources created for this connectathon, I can see that they average about 3K in size, maybe that would be 4K in real life.  Compressing this data, I can see about a 14:1 compression ratio.

So, 1700 patients * 30 resources * 4KB / resource = 199 MB
And compressed, this yields about 14Mb.

Over typical (residential) network connections, this might be 6 to 30 seconds of data transfer, plus processing at the other end (which can happen while the data is being transferred).  At commercial speeds (1Gb or higher), this goes to sub-second.  Parse time for 200 Mb of data (in XML) is about 7 seconds on my laptop, about 5% faster and about 70% of the size in JSON.

We've got about 6000 hospitals in the US, with an average of 150 beds per hospital, so the average size is going to be around 13 times smaller (and faster), so 15Mb files on average.  I'm more interested in sizes for the top decile (90%).  I can estimate those by looking at a state like New York (which happens to have hospitals at both extremes), and see that the 90% level is reached at hospitals with ~ 750 beds.  So 90% of hospitals would be reporting in around 1/3 of the time or less than that faced by the largest facility, and sending files of about 4.5Mb or less compressed, or 67Mb uncompressed.  I am unworried by these numbers.  Yes, it's a little big for the typical RESTful exchange, but not at all awful for other kinds of stuff being reported on a daily or every 4, 8 or 12 hour basis.

All of the data needs to be parsed and stored, and it could be a lot of transactional overhead, leading to long delays between request and response.  But my 200Mb file parsing test shows that to be < 7s on a development laptop running WinDoze.  My own experience is that these speeds can be much better on a Linux based system with less UX and operating system overhead.  

Posting these in a transaction bundle leads to some serious challenges.  You really don't want to have a several second long transaction that has that much influence on system indices.  This is where ACID vs. BASE is important.  We really don't need these to be handled in a transaction.

Honestly, I like keeping the supplemental data in the MeasureReport as contained resources.  The value in this is that a) these resources are snapshots in time of something being evaluated and b) the identity of those resources has little to no meaning outside of the scope of the MeasureReport.  In fact, their identity is virtually meaningless outside of that context.  BUT, if we were to instead put them in a bundle with the MeasureReport (a composition-like approach), it could make sense.  Except that then, one would have to make sense of the resource identities of these across multiple submitters, and that's just a big mess that your really need not invite into your storage infrastructure.

In any case, wee haven't made a decision yet about how to deal with this, although for this Connectathon, we will be taking the bundle approach.

* These numbers are based on some research I contributed to about 16 years ago based on NLP analysis of a large corpus of data collected over the course of a mid-sized hospital's year treating patients.
 


0 comments:

Post a Comment