Thursday, September 13, 2018

Sorting Codes and Identifiers for Human Use


These pictures demonstrate one of those little annoyances that has bothered me for some time.

 

Have you ever used a tool to display a list of OIDs, codes or other data?  Were you surprised at the results of sorting of codes?  Most of us who work with these things often are, but only for a moment.  However, that brief moment of cognitive dissonance is just enough to distract me from whatever I'm doing at the moment and thus it takes me just a little bit longer to finish what I'm doing.

The first example above came from searching LOINC for blood pressure, then sorting by LOINC code.  Notice that all the 8### series codes appear after the 7####-# and 8####-# series (starting at page 2).  The second comes from a search of Trifolia for Immunization, sorting by identifier.  Look at the order of the identifiers used for Immunization entries in the HIV/AIDS Services Report guide.  Note that urn:oid:2.16.840.1.113883.10.20.31.3.36 comes before urn:oid:2.16.840.1.113883.10.20.31.3.4.

The problem with sorting codes which have multipart components that contain numeric subparts is that they don't sort the way we expect them to when you apply alpha sorting rules to them.

The "right" way to sort codes is to divide them up at their delimiters, and then sort them component-wise, using an appropriate sort (alpha or numeric) for each component, based on the appropriate sort for that component, in the right order.

For LOINC, the right sort is numeric for each component, and the delimiter is '-'.  For OIDs, the right sort is numeric for each component, and the delimiter is '.'.  For both of these, the right order is simply sequential.  For HL7 urn: values, it gets a bit weird.  Frankly, I want to sort them first by their OID, then by their date, and finally by the urn:oid: or urn:hl7ii: parts, but putting urn:oid: first because it is effectively equivalent to a urn:hl7ii: without an extension part.

A comparator for codes that would work for most of the cases I encounter would do the following:

  1. Split the code into an array of strings at any consecutive sequences of delimiters code.split("[-/.:]+").  
  2. Then, for each component in the two arrays
  3. If the first component is numeric and the second is not, the first component is < the second, and the converse.
  4. If the first and second components are numeric, order them numerically.
  5. Otherwise, sort alphabetically, using a case sensitive collation sequence.

NOTE: The above does not address my urn:hl7ii: and urn:oid: challenge, but I could live with that.

There are more than 50,000 LOINC codes, 68,000 ICD-10 codes, 100,000 SNOMED CT codes, 225,000 RxNorm codes, and 350,000 NDC codes.  If sorting codes the right way saved us all just 1/2 second per code in each of those vocabularies, we'd have 3 days of more time in our lives.

I don't know about you, but I could sure use that time.


0 comments:

Post a Comment