Friday, February 19, 2021

Load Testing is the Most Expensive Effort You Will Regret Not Doing

Photos of the I-5 Skagit River Bridge

Time and time again I’ve seen major failures in information systems that fail under real load.  This happens over and over again. It happened again today in my home state when the state’s vaccine scheduling system failed shortly after being opened up for scheduling.  I tried using it at least five different times today to schedule myself and my wife.  I finally gave up, realising that the system simply was under way too much strain.  Thirty second page loads, timeouts,  mystic error messages never meant to be exposed to an end user, it’s all classic symptoms of a system under way more load that it can handle.

The solution is very simple, you have to test systems under the load that they are designed to be used.

The problem is, that’s one of the most expensive tests software developers perform.  It can take a month and more just to prepare.  A big part of that is simply getting enough synthetic data to test with.  The data has to be good enough to be used with your validation checks.  The system has to be designed so that you can run such a test at scale without committing other systems to act on the test data.  It’s hard, it’s expensive, and often by the time you are ready to load test, the project is already late, and product needs to ship, or at least that’s what management says.

And so, against better judgement, the product ships without being tested.  And it fails.  Badly.

Somewhere, a resource gets locked for long than it should, and that causes contention, and the system slows to an unusable crawl.  It’s an unneceaary table lock in a database.  A mutex on a singleton in the runtime library.  A semaphore wrapped inappropriately around a long running cleanup operation.  Diagnosis takes days, and then weeks.  Sometimes a critical component needs to be rearchitected, or worse, replaced.  At other times, the fix is simple, once you finally find the error.  And sometimes there’s a host of optimizations that are needed.  What would have delayed the delivery of the product by weeks now delays it by months... or even years.  Some never recover.

Often, the skills necessary to engineer the system to a sustained load are simply absent, to the point that the expected loads and response times were never actually provided as design inputs. Nobody computed them, b/c the system is new, and nobody knew how to without any real world experience.

Load testing is the most engineering intensive effort of “software engineering”.  It’s the kind of effort that distinguishes a true software engineer or system architect from a “computer programmer”.  

Discovering that your system won’t operate under the expected load is not the worst thing that can happen to you.  Doing so after you’ve “gone live” is.  Now you have three expensive efforts simultaneously to address:

  1. The political fallout of the disaster to manage.
  2. The recovery effort on the broken data streams created by the failed system.
  3. The load testing you should have done in the first place, including the accompanying remediation of issues found from that effort.
Don’t ever skip load testing, at least if you want to continue to call yourself a software engineer.

Friday, February 12, 2021

Enhancing Search capabilities in HAPI using SUSHI and SearchParameter

I've been using HAPI on FHIR for several years across multiple projects, sometimes with my own back end for storage, and at other times using the vanilla HAPI JPA Server with a database back end.  One of the features of the JPA Server is that you can enhance search capabilities by telling the server how to implement an enhanced search capability by creating a SearchParameter resource on the server.

The FHIR SANER Implementation Guide defines several SearchParameter resources.  The simplest of these enables searching by Encounter.hospitalization.dispositionCode and is a good starter example for those who just need to search on a single field value for a resource where the core FHIR specification doesn't define a search capability.  A more complex example can be found in the Search by Code which enables search on several resource types for a single field.  While technically correct, this example is more complex than HAPI JPAServer can handle, and I'll talk later in this post about how it might be simplified to enable JPAServer to handle it.

If you are defining a SearchParameter resource for an HL7 or IHE Implementation Guide, the first thing you need to do is specify a bunch of metadata associated with the resource.  If, like me, you have to do this for a number of different resources, SUSHI has a syntax that enables you to effectively create a macro, a set of instructions that can be included in any resource definition.

The set of instructions I use is:

RuleSet: SanerDefinitionContent
 * status = #draft      // draft until final published
 * experimental = true  // true until ready for pilot, then false
 * version = "0.1.0"    // Follow IG Versioning rules
 * publisher = "HL7 International"
 * contact[0].name = "HL7 Public Health Workgroup"
 * contact[0].telecom.system = #url
 * contact[0].telecom.value = ""
 * contact[1].name = "Keith W. Boone"
 * contact[1].telecom.system = #email
 * contact[1].telecom.value = "mailto:my-e-mail-address"
 * jurisdiction.coding =

We'll change status to #active, experimental will be set to false, and the version will be updated when we publish as a DSTU.  We follow the HL7 conventions for the first contact (the web page for the workgroup responsible for publishing the IG), and I add my contact information as the editor.  The jurisdiction.coding value is set to the value commonly used for Universal guides.

Other metadata you have to create describes the specific search parameter.  I put that directly in the SearchParameter instance:

Instance: SearchParameter-disposition
InstanceOf: SearchParameter
Title: "Search by hospitalization.dispositionCode in Encounters"
 * insert SanerDefinitionContent
 * url = ""
 * description = "This SearchParameter enables query of encounters by disposition to support automation of measure computation."
 * name = "disposition"
 * code = #disposition

The instance name should start with "SearchParameter-", and should be followed by the value of the name you are going to use for search.  This is what you would expect to appear in the query parameter.  For this example, the search would look like "GET [base]/Encounter?disposition=...", so we use disposition as both the name and the code for this parameter.  I'd recommend keeping the name and code the same.

Finally, you need to specify some of the technical details about the Search parameter.  For this example, it applies only to the Encounter resource, and operates against a code.  Here are the additional settings you would need to specify:

 * base[0] = #Encounter
 * type = #token
 * expression = "hospitalization.dispositionCode"
 * xpath = "f:hospitalization/f:dispositionCode"
 * xpathUsage = #normal
 * multipleOr = true
 * multipleAnd = false

  1. The base parameter allows you to identify the resources to which the search parameter applies.
  2. The type parameter indicates which kind of search parameter type to support.
  3. The expression parameter describes how to find the data used for the search in the FHIR Resource using FHIRPath.
  4. The xpath parameter describes how to find the data used for the search using an XPath expression over the FHIR XML.  For the most part, this is simply the same as expression, using different syntax.
  5. Generally, you want to leave xpathUsage set to #normal as above.
  6. If you want the parameter to be repeatable using a logical or syntax, set multipleOr to true.  For this case:
    GET [base]/Encounter?disposition=01,02 would mean: Get all dispositions where the code value is 01 or 02.
  7. If you want the parameter to be repeatable using And, set multipleAnd to true, otherwise set it to false.  If the field in the resource is not repeatable, you can very likely leave this set to false.
For more complex search parameters, you can add multiple resource types for the base parameter.  We did that in the SANER IG to support search by codes in Measure and MeasureReport.

Here's an example of how we did that, but there are some caveats which I'll get into below:

 * code = #code
 * base[0] = #Measure
 * base[1] = #MeasureReport
 * type = #token
 * expression = """
 descendants().valueCodeableConcept | descendants().valueCoding | descendants().valueCode | code | descendants().ofType(Coding).not().code

 * xpath = """
 descendant::f:valueCodeableConcept | descendant::f:valueCoding | descendant::f:valueCode | f:code | f:descendant::f:code[ends-with(local-name(..),'oding')]
* xpathUsage = #normal
* multipleOr = true
* multipleAnd = true

Technically, the above expression and xpath are accurate.  However, this search parameter appears to be too complex for HAPI JPA Server to handle.  I expect that is because it operates over a simplified syntax of either the XPath or expression content, so I will be digging into that.  I think the ofType() or [ends-with...] expressions might be causing problems.  There are other ways that I can make these more explicit which would likely work better.

Tuesday, January 26, 2021

Rethinking Vaccination Scheduling

Thinking about getting shots into arms, scheduling and planning and logistics for this.  There are a lot of resources that you need to keep track of.  Any one of these could be a rate limiting factor. 

  1. Vaccination supplies (Not just doses, but also needles, cleaning and preparation materials, band-aids and alcohol wipes)
    I'm simply not going to address this issue here.  This is logistics and situational awareness, and this post is thinking a lot more about managing and scheduling vaccination appointments.
  2. Space to handle people waiting to check-in
    If you use cars and parking lots and telephone / text messages for check-in, you can probably handle most of the space needs for check-in, but will not be able to support populations that do not have access to text messaging.  You might be able address that issue with a backup solution for that population depending on size.
  3. People to handle check-in
    The tricky bit here is that the people who handle check-in need to be able to see or at least communicate with the people administering vaccinations so that the queues can continue to move ahead, and see or communicate with people who have checked in and are ready to get their vaccination.  There are a lot of ways this can work, and much of it depends on the facility's layout.  A football stadium provides much more space and opportunity for handing this problem than the typical ambulatory physician office.  With a small enough volume, checkin can be handle by the administering provider, with larger volumes but the right technology, it could still be pretty easy.  
  4. Handling Insurance Paperwork
    The biggest challenge with check-in will be the whole insurance paperwork bundle.  Ideally, this could all be addressed before making the appointment.  Because while patients will pay nothing for COVID-19, Medicaid, Medicare and private insurers may still need to shell out to providers who do it.  A smart intermediary could address this by supporting some sort of mass vaccination protocol for logging patients to payers, and vaccinations back to "appointments" (see more on scheduling below).
  5. People to administer vaccinations
    Right now, the two vaccinations use intra-muscular injection and multi-dose vials for administration.  Anyone who's had a family member taking insulin, or having been through fertility drug treatment understands that for the most part, the injection part isn't brain surgery, and also doesn't take that long (I've dealt with both in my family).  This is probably the least rate limiting factor after vaccination supplies.
  6. Space to handle people who've had a vaccine administered, but still need to be monitored for an adverse reaction.  The two to three minutes it takes to administer a shot requires 10+ minutes thereafter and a minimum of 20 square feet of space for each person to whom it is administered to address potential adverse reactions (with 6' social distancing measures).  
  7. People to monitor adverse reactions
    I don't know what the right ratios are here, but one person could monitor multiple patients at a time.
  8. People to treat adverse reactions
    This is a different ratio and skill set than for #6 above.  The skillset to treat a problem is likely more advanced than to detect it, but you can probably only treat one adverse reaction at a time, and might need to plan for two or more just in case.

And then there's scheduling.
Scheduling for this within a single facility with existing systems in ambulatory or other practice environments would be rather difficult.  Most ambulatory appointment scheduling systems are broadly designed to handle a wide variety of cases, not designed for highly repetitive case loads like a mass vaccination campaign.  The closest thing in the ambulatory space for this is laboratory testing, where specimen collection is more "assembly line" oriented.

For tracking laboratory testing, it's less about the appointment, and more about the order in the ambulatory setting. The order is placed, and when the patient shows up, the specimen collection is done.  If we work on mass vaccination more like that, then scheduling could be a bit easier. The order basically grants you a place in line, but you still have wait in line until your turn. If you've ever tried to get a blood draw done during lunch hour, you may have been in this situation.  This seems like a better way to go for a mass vaccination campaign.

You no longer get an appointment for a 10-25 minute slot, but instead maybe get a day and possibly a 2-3 hour time period assignment that you are asked to show up within, but can use that slot any time after that point in time. The time period assignment is used to maintain the flow throughout the day, but it's more advisory than a typical ambulatory appointment slot is.

Regional Scheduling
The value of this in scheduling is that each facility can estimate a volume of patients to handle on any given day (or broad time period within a day) based on staffing and other resources.  These volumes can be fed into a system to support scheduling on a daily basis, which can then be used to broadly manage scheduling, not just within a facility, but perhaps even across a state.  If the scheduling system is also capturing insurance information, that could get more complex, but I think more realistically, the scheduling system can be used to feed data to vaccination sites, and the vaccination sites can follow-up with the patient out-of-band regarding insurance stuff using whatever system they have to handle that.  That's a more flexible approach.  This would mean maybe a 24-48 hour delay between scheduling and first appointment slot availability, but perhaps not, because it could be possible that some providers would also be able to set up web sites for patients to register their insurance details, and others might already have the necessary details.  I could use the state system to schedule appointments, and still go to my regular provider for the vaccination.  Providers participating in this might reserve some capacity for their regular patients.

If other service industries can handle scheduling in a 2-3 hour block range to manage their workloads, maybe we can apply this technique to scheduling for a mass vaccination program.  We don't need a precise schedule, we need a sustained rate and flow.  In any case, it's worth thinking about.

This is all just a thought experiment, I'm not going to say it will work, or even that I've convinced myself that it has some value.  It just seems to address one of the key problems in getting shots into arms, which is getting the arms to where the shots can be placed.


Friday, January 22, 2021

Situational Awareness and Disease Surveillance

There's broad overlap between Disease Surveillance efforts and Situational Awareness reporting.  If you look back to early March, former National Coordinator Farzad Mostashari illustrates on twitter the use of ILI reporting systems to support COVID-19 Situational Awareness.  Surveillance efforts abound: Biosurveillance, ELR, ECR, ILI, reportable/notifiable conditions, et cetera.

Surveillance efforts can be classified a couple of different ways, at the very least a) what you are looking for, and b) how you respond to that event.  You are either looking for a known signal (e.g., ILI, or a reportable/notifiable condition), or simply a deviation from a normal signal (e.g., biosurveillance, and to some degree ILI).  You can (especially true for known signals), trigger a predefined intervention or response , or "investigate", or simply communicate the information for decision making at various levels.  COVID-19 dashboards which show hospital / ventilator capacity are often used to support various kinds of decisions (e.g., support and supply), as well as to communicate risk levels to the public.

If you think about such efforts as reporting on bed capacity or medication usage related to COVID-19, you need to be able to a) check lab results and orders, b) evaluate collections of patient conditions (e.g., to detect suspected COVID-19 based on combinations of symptoms) and c) examine medication utilization patterns.  All of this can also be used to support various kinds of surveillance efforts.

Surveillance goes somewhat deeper than situational awareness.  The most common case is turning a positive signal for identifying a case into a case report for follow-up investigation, as in Electronic Case Reporting.  This goes beyond basic Situational Awareness reporting.  Case reporting can get rather involved, going deep into the patient record.  Where SA efforts are more aligned is when the initial data needed (e.g., as for an Initial Case Report) is fairly well defined.  For that, we have been defining mechanisms whereby the supplemental data reported in the measure can also be used to support those sorts of efforts.

The overlap between these efforts points to one thing, which is a general need to address the inefficiencies of multiple reporting efforts to public health.  The various reporting silos exist because of a fractured healthcare system, fractured funding, and the various legal and regulatory challenges for consolidating public health data.  There's NO quick fix for this, it will likely take years to get to a consolidation of methods, standards and policies across public health reporting initiatives, but it's something that's worth looking into.

Thursday, January 21, 2021

Normalizing FHIR Queries using FluentQuery

One the advantages of declarative specifications is that they tell a system what needs to be done without specifying how to accomplish it.  When I created FluentQuery for the SANER IG, it had a lot more to do with making FHIR Queries clearer, rather than providing a declarative form for query specifications, but it effectively does create a declarative form (because FHIRPath is a declarative language).

My initial implementation of this form is rather stupid, it just builds up the appropriate FHIR Query string to accomplish what has been asked for.  After going through Connectathon last week, we learned more about the differences in end-points, e.g., a FHIR Server vs. an EHR system implementing 21st Century Cures APIs, and I wound up having to "rewrite queries" to support the simpler syntaxes supported by the EHR.  This was completely expected, but what I hadn't realized beforehand, was that I could actually automate this process in the FluentQuery implementation itself.  

What I'd accidentally done by creating FluentQuery was enabled interoperability across varying FHIR Implementations so that a FHIR Path interpreter implementing FluentQuery could allow a user defined query to be implemented partially or fully by the target FHIR Server, with the rest of the filtering being handled by the receiver.  Doing so in such an implementation would allow a MeasureComputer actor to adapt to servers which are being queried based on their CapabilityStatement resources or other known factors within an implementation.

Let's look at some examples, starting with including().  Here's a query using including:

findAll('Encounter',including('subject','diagnosis','reasonReference'), …).onServers(%Base)

My simplistic implementation would write this as, and execute it using the resolve() function of FHIRPath.  My implementation of resolve already does some additional work here, addressing collection of all result pages when there is more than one page of results.


If the server supports includes, but does not support specific kinds of includes, the query can be written with _include=Encounter:*, and the resulting included references in the resulting Bundle can be filtered out of the results after the fact by removing those that aren't necessary after collecting all pages.

If the server does not support includes, then the resulting Bundle can be post-processed to call resolve() in the elements that should have been included, as if it was written:

   .select(Encounter | Encounter.subject.resolve() | encounter.diagnosis.condition.resolve() |

I've got several queries that work with ValueSet resources such as this one below:

findAll('Observation', for('patient', $, with('code').in(%Covid19Labs) )

If the server supports value set queries, the query would be written as:


But, if it does not support value set queries, but does support multiple parameters, and the value set is small (e.g., less than 10 values), it can be rewritten as:


And if it does not support multiple values in code, then it can be rewritten as multiple queries, and the resulting Bundles can be merged:


And finally, if it does not support the query parameter at all, it can be post-filtered as if it was written:

findAll('Observation', for('patient', $

FluentQuery turns out to be a lot more powerful than I had expected.  A dumb implementation is VERY easy to implement, taking about 500 lines of Java.  The findAll and whereExists method write the query expression by writing out a ?, the resource name, and the remaining arguments to the function separated by & characters.  Each individual function argument writes the appropriate query expression where with() operates by simply writing out the field name, and the various comparison expressions concatenate that with the appropriate comparison operations.  And onServers simply prepends the server names to the query and calls resolve().

To change that to support an adaptive implementation, I would instead write out each comparison expression as a collection of Parameters (using the Parameters resource), and then parse the list of parameters in onServers based on the CapabilityStatement resource for the servers given in arguments to it to execute the appropriate FHIRPath expression.

In fact, I expect I might even compile some small FHIRPath expressions in onServers() that are associated with the Parameters resource used to memo-ize the query operation.  In HAPI, I can probably store the the compiled expressions in the Parameters resource as User data.  If I wanted to get fancier, I could even rewrite the compiled expression.

The rest is, as a former colleague of mine liked to say, a simple matter of programming.

Tuesday, January 19, 2021

Automating FHIR IG Builds From Structured Data

This is likely to be a two or three part series.  Over the last year I've worked on four different FHIR Implementation Guides:  SANER, Mobile Health Application Data Exchange Assessment Framework, and V2 to FHIR. and the IHE ACDC Profile.  Each of these suffers from the "Lottery Problem" in that if I win the lottery and decide to retire, I'm going to have to train others how to run the tool chain that builds that content that the IG Publisher uses to generate the IG.

I hate having the lottery problem, because it puts me into the position of being the one who has to both maintain the toolchain, and run the builds when something important changes.  I can't get away with not maintaining the toolchain, but what I can do is make it a lot simpler so that it's obvious for anyone how to run it, and I've been working through that process as I advance the tools I'm using.  These days, my most recent project (SANER) only relies on Saxon's XSLT transformer and PLANTUML, which is better than a lot of custom JAVA code.

But that still leaves me in the Build Miester role, and I'd love to get out of that.  So I spent some time over the holidays and thereafter working towards getting GitHub Actions take over the work.  I'll work backwards from most recent to least recent to simplify (easy to hard), starting with SANER.

Saner runs four processes to complete a build: Two XSLT 2.0 based transforms of SANER.xml, and a PLANTUML transform of outputs produced by that to generate UML images from the images-source folder of the build, and finally ig-publisher.  I DON'T need to automate IG Publisher, because handles that for me using the latest publisher and Github webhooks via the AutoBuilder.

But: I DO want to automate the transformations, and I'd also love to have an easy way to run the publisher locally from Eclipse or the command line.

The trick here is to use Github Actions.  There are three actions I need at least:

  1. Checkout the Files that have been pushed on a feature branch.
  2. Run a mvn build to perform the translations
  3. Checkin the updated files.
The Maven build will handle the translations using two plugins:
  1. The Maven XML Plugin
  2. The PlantUML Plugin
Finally, I'll also enable the FHIR IG Publisher to be run using Maven through the Exec Maven Plugin during the "site" phase of the build.

Wednesday, January 13, 2021

Supplemental Data Part 2

One of the challenges with Supplemental Data is in dealing with how to communicate it.  One can send it

  1. in references only, expecting receivers to then retrieve those elements needed,
  2. in references only, having previously communicated the referenced data to the receiver individually,
  3. as contained resources within the MeasureReport,
  4. in a Bundle with the MeasureReport,
There are some challenges with each of these.
The reference only model requires the sender to provide a back-channel to respond to queries for specific resources, with all of the concomitant security issues that raises.  Sending an HTTP request out from inside your IT infrastructure doesn't require cutting holes through firewalls, receiving requests does.

The references with previously send model runs into problems with resource identity, eventually two different facilities will produce the same resource identifier.  And if the resources are just forwarded as is and assigned a new identifier by the receiver, then the measure source then has to reconcile those identities.

The collection of supplemental data can lead to large files, which for some, violates "RESTful" principles.  The idea in FHIR is to use fine-grained resources, but containment of supplemental data could make the MeasureReport rather large.  Let's do some estimating on size here:

One of the largest hospitals in the US is in New York City, and has something like 2000 beds.  Reviewing hospital inpatient capacity metrics (via HHS Protect) shows me that there's a range of utilization of inpatient beds ranging from 45% to 85% across the country.  So, if we are building a report on 85% of 2000 possible inpatients that would be 1700 patients being reported on.  Using a reasonable definition for supplemental data (e.g., what we are using in this Connectathon) let's count the number of contained resources: For each patient, we'd have something like a dozen conditions on average*, one or two encounter resources, the current patient location, maybe a half dozen medications*, a two or three lab results, an immunization resource, a DiagnosticReport, and a ServiceRequest.  Call it 30 resources on average for each patient.  Looking at the test resources created for this connectathon, I can see that they average about 3K in size, maybe that would be 4K in real life.  Compressing this data, I can see about a 14:1 compression ratio.

So, 1700 patients * 30 resources * 4KB / resource = 199 MB
And compressed, this yields about 14Mb.

Over typical (residential) network connections, this might be 6 to 30 seconds of data transfer, plus processing at the other end (which can happen while the data is being transferred).  At commercial speeds (1Gb or higher), this goes to sub-second.  Parse time for 200 Mb of data (in XML) is about 7 seconds on my laptop, about 5% faster and about 70% of the size in JSON.

We've got about 6000 hospitals in the US, with an average of 150 beds per hospital, so the average size is going to be around 13 times smaller (and faster), so 15Mb files on average.  I'm more interested in sizes for the top decile (90%).  I can estimate those by looking at a state like New York (which happens to have hospitals at both extremes), and see that the 90% level is reached at hospitals with ~ 750 beds.  So 90% of hospitals would be reporting in around 1/3 of the time or less than that faced by the largest facility, and sending files of about 4.5Mb or less compressed, or 67Mb uncompressed.  I am unworried by these numbers.  Yes, it's a little big for the typical RESTful exchange, but not at all awful for other kinds of stuff being reported on a daily or every 4, 8 or 12 hour basis.

All of the data needs to be parsed and stored, and it could be a lot of transactional overhead, leading to long delays between request and response.  But my 200Mb file parsing test shows that to be < 7s on a development laptop running WinDoze.  My own experience is that these speeds can be much better on a Linux based system with less UX and operating system overhead.  

Posting these in a transaction bundle leads to some serious challenges.  You really don't want to have a several second long transaction that has that much influence on system indices.  This is where ACID vs. BASE is important.  We really don't need these to be handled in a transaction.

Honestly, I like keeping the supplemental data in the MeasureReport as contained resources.  The value in this is that a) these resources are snapshots in time of something being evaluated and b) the identity of those resources has little to no meaning outside of the scope of the MeasureReport.  In fact, their identity is virtually meaningless outside of that context.  BUT, if we were to instead put them in a bundle with the MeasureReport (a composition-like approach), it could make sense.  Except that then, one would have to make sense of the resource identities of these across multiple submitters, and that's just a big mess that your really need not invite into your storage infrastructure.

In any case, wee haven't made a decision yet about how to deal with this, although for this Connectathon, we will be taking the bundle approach.

* These numbers are based on some research I contributed to about 16 years ago based on NLP analysis of a large corpus of data collected over the course of a mid-sized hospital's year treating patients.