Convert your FHIR JSON -> XML and back here. The CDA Book is sometimes listed for Kindle here and it is also SHIPPING from Amazon! See here for Errata.

Wednesday, October 12, 2011

The Effort of Usability Testing

I'm continuing my review of the NIST Technical Evaluation, Testing and Validation of the Usability of Electronic Health Records (NISTIR 7804).  Today I'm going to review Chapters 6 and 7, and Appendices A through E and G.  I'm just going to focus on estimating the costs of the evaluations suggested.

Expert Review
Chapter 6 describes a process using expert review.  Appendix B provides a 22 page sample Expert Review form.  I have lots of quibbles with content on this form, but I'll save that for tomorrow's post (well, OK, it is 2:00am, Thursday's post).  There are 198 items in the sample form, I'm going to round up to 200 for simplicity.  I've also run through the MU Stage 1 criteria and identified within it concepts that identify about 20 "screens" or "pages" that need to be reviewed, twice by each expert.  I'm going to guess that an expert needs about 10 minutes a "screen" to complete the checklist (see caveats below), giving me about 200 minutes, or slightly more than 3 hours.  Preparing the final report from the expert review would need perhaps three hours to review the 200 items together (less than one minute per item), and another six by a single expert to complete the report.  To make it less expensive, I'm going to assume only two expert reviewers.  This is a little over18 hours, to make my numbers easier, I'm going to round up to 20 hours total.  Now, I'm going to assign a loaded cost of $150/hr for the expert reviewers (based on skill levels suggested, which is a pretty senior level staffer).  That's $3000 for the review.  Not too bad, assuming you have the skills on staff.  If you don't it will probably take longer, require more training (of the reviewers in your system), and will greatly increase in cost.

Caveats:  If you are using your own staff, and they are already familiar with their checklists, this will go quickly, because they'll spot stuff right away and know where it goes.  If not, the first few times it will go a lot slower, and subsequently the pace will pick up.  After five or ten times, they'll be able to do it more quickly.

My guestimate on costs and effort are just that, guestimates.  Your mileage may vary.

The expert review sheets that NIST provides are very detailed.  This is a tremendous piece of work, but it has several gaps/overlaps/holes in it that require additional expert review before it could be used by an organization.  I'll get into some of the details tomorrow, but just to show some things to watch out for:  Some guidelines (e.g., 7.2 Have dots or underscores been used to indicate field length? 8.6 Are long columnar fields broken up into groups of five, separated by a blank line? 9.3 Do expert users have the option of entering multiple commands in a single string?), go back to green-screen CRT terminal days.  Others deal with questions about "windowing environments" that are more often than not features supplied by the OS, not the application, and could easily be skipped after one assessment.  So, as a template, these forms are good starting points, but would need to be customized.  Even so, customization of existing work is a heck of a lot easier than starting from scratch.

Validation Testing
Chapter 7 describes the protocol for Validation Testing EHR Applications.  Chapter 7 is very detailed, well thought out, and extremely thorough.  But I'm rather afraid that it does not scale.  One of the challenges of this section is that a single EHR is used by a great number of different kinds of users, with regard to specialty, scope of practice and licensure, et cetera.  I don't know if a typical EMR platform needs one, two or ten of these summative evaluations, but I know it probably needs more than one if customized for a variety of different specialties.  The protocol described is not one that many smaller organizations could readily support given the extensive nature of it, the facilities required, and the time and effort needed to be invested in it.

In chapter 7, a test group size of 10-20 is suggested, but no specific recommendation is given.  Members of the test group may need training depending upon how they are selected for the test, taking from 0 - 2 hours. The test scenarios, as described require a participant, the test administrator, and a data logger.  According to the protocol, participants are individually tested (although they may be trained in a group).  Appendices C through E present 22 tasks in three separate sample scenarios to test.  These samples cover scenarios for Ambulatory care, ED (incorrectly described as "inpatient") care, and inpatient care in an ICU, but do not fully cover usability for all requirements found in meaningful use (Stage 1).

Observation by the data logger and test administrator (skilled observers) are a critical component of the protocol, meaning they are present and active during every task and scenario.  In looking over the tasks, and the various steps needed for each, I'm estimating a task takes from 2-5 minutes, with the most likely time being three minutes.

Appendix G provides a sample tester's guide from which I estimate a few other items of interest:

  1. Consent/NDA and other Administrivia 5 minutes
  2. Introduction/Overview 5 minutes 
  3. Preliminary questions 5 minutes
  4. Setting up a task (1 minute) x # tasks
  5. Task performance (3 minutes) x # tasks
  6. Rating the task (1 minute) x # tasks
  7. Final Questions  5 minutes 
  8. Usability Ratings  5 minutes   

To be reasonable, I don't expect a participant to perform well if they are engaged for more than about 90 minutes of their time without a break, and given the participants we are talking about (doctors and nurses), getting that much of their time plus travel to the testing site, et cetera, will also be a challenge.  The guide mentions compensation for the participants time, but offers no guidance on what that compensation should be, or what applicable federal and state regulations regarding such compensation must also be considered.

I'm going to look at at simple scenario in which testing is done at the participants' site, and each participant requires minimal training (30 minutes), tests two scenarios with a total of 12 tasks, with 10 participants.  Participants may test the same two scenarios, or multiple scenarios may be spread out over the 10 participants, it doesn't really matter for this example.

Recruitment:  Getting a list of potential participants... this could be easy or hard, let's call it easy.  You need to identify more willing participants than you will eventually use, and will need to schedule more than the 10 needed (to deal with no-shows, emergencies, et cetera).  Call it a day of effort by the recruiter to identify the necessary folks, and I think that's being generous.

Each participant is going to require face time for 90 minutes with two other people, so that's 15 hours, or two person-days for the test administrator(s) and a data logger(s).  Assuming the training can be done in one or two groups, that's another 30-60 minutes, so call it two person-days total.  Add another day for preparing the summative report based on all inputs from all participants.   So, four person days total.  I'm going to use (again, I think generous) loaded rate of $125/hr for both the administrator and the data logger performing the test, and assume that one or both does double duty as trainer.  (Note: That cost could be for a consulting team, or the loaded personnel costs for staff who do the work)

So, $1000/day x 4 person days = $4000.  You've covered 12 tasks, which would account for about 25% of the MU Stage one requirements if each took one task to perform.  There are 44 Stage 1 Testing Criteria.
I am estimating that each MU Stage one requirement could consume anywhere from 1-5 tasks, with more than one being very common because the sample tests address the same MU critieria in more than one way to deal with different variations important to usability (e.g., adding vs. revising a medication).

So, to completely cover MU Stage 1, with just 10 providers, who aren't compensated, would require multiplying that test process 7 more times if the average number of tasks was 2 per MU criterion.  So, now we are up to $32,000 to fully cover the stage 1 criteria for usability.  This, by the way, is about what an organization pays per product to a certifying body under meaningful use (not what it fully costs, just the fees that they must pay).

More caveats:  My guestimates are just that, again!

Some Additional Observations
Now, does every item in MU need to be covered?  Can you use a leaner model?  Quite likely.  One of the issues here is figuring out which of the MU items might need to be covered by usability testing, and at what level of detail.  For example, e-Prescribing is high complexity, high-risk area that probably needs a lot of testing.  But entry of MOST patient demographic information probably needs less, with the exception of information that does have patient safety risks.  This is one of the places where a risk assessment will help organizations identify what to test for usability.  That's interesting, because the NIST protocols don't explicitly mention the use of "risk assessments" to guide usability testing. I suspect this is because it is implicit but not explicitly called out in their step one:
During the design of an EHR, the development team incorporates the users, work settings and common workflow into the design. Two major goals for this step that should be documented to facilitate Steps Two and Three are: (a) a list of possible medical errors associated with the system usability, and (b) a working model of the design with the usability that pertains to potential safety risks.
What makes this kind of testing scalable is the use of risk assessment guided review and validation of the safety critical issues it identifies.

Now, here's the kicker.  Safety based usability assessment is a somewhat different animal than usability assessments meant to address market adoption barriers.  I easily fell into the trap of assuming the two were the same until I wrote out this summary, and then went back to Monday's post.  From rereading that;  Safety issues are part of the need for usability testing, but other issues include productivity and morale.

Risk assessment techniques can still be used to make the usability testing more useful, but the risks must address adoption barriers, only one of which is safety issues.  At that is no longer a technical question, but a marketing one.

Which brings me back to Appendix A (which you probably thought I forgot by now).  The human factors experience cited in this appendix focuses principally on high risk activities and safety, not experience with market adoption barriers:  These include: War (US Army), Nuclear Power (NRC), Flight (FAA), and Medical Devices (FDA).

I don't know how to address market adoption barriers with usability testing, and unfortunately, the one person that immediately comes to mind to ask died last week. God bless you Steve.  Here's hoping we learn from your examples.

-- Keith

P.S.  I promise not to focus the whole week on this topic, you will hear some more about IHE PCC meetings later this week.


Post a Comment