Monday, July 17, 2017

The Vocabulary of Vocabularies for the non-Vocabularist

In informatics and health IT standards circles we talk about Ontology in a way that assumes that everyone understands what that means.  Yet the ontology of ontologies is rather complex and confusing, in part because definitions are in some ways, rather "meta".

What follows are my own definitions, written to be intelligible to a software engineer without informatics or health IT standards training.

TermA word or phrase describing a thing, and associated with a concept (see Concept)
Preferred TermDifferent words or phrases can be used to describe the same thing.  A heart attack can also be described as a myocardial infarction.  The preferred term is the term for that thing that is preferred above all others.  Preferred terms are often those with the most "crisp" meaning (e.g., Myocardial infarction in the above example).
ConceptThe abstract idea associated with a thing.  This is the idea that is (intended to be) brought to mind when one uses a term to describe a thing.  A good concept has a definition associated with it that allows a user to understand the meaning of it.  In my not so humble opinion, the position of a term in an ontology is an insufficient definition because it is only intelligible to those well versed in the structure of the ontology.
CodeAn identifier associated with a concept.  It is a string of symbols (usually digits, but also can include letters and punctuation) that uniquely identifies the concept within the system using that code (see system below).  Codes can be meaningless (e.g., as used in LOINC), or can also express structure (usually hierarchy) as in ICD or the Healthcare Provider Taxonomy.
Check DigitA digit or character in the code used to enable verification that the string used as a code is valid and is not the result of some sort of data entry error.  Check digits prevent various kinds of data entry errors including deletion, insertion, and transposition.  LOINC uses MOD10, and SNOMED uses Verhoeff algorithms to compute check digits.
VocabularyA collection of terms describing things.  A good vocabulary has both the terms, and associated definitions of what those terms mean in a way that the users of that vocabulary can understand the meaning.
Value SetA managed collection of terms describing related things and associated with some sort of identifier so that it can be referenced. A value set most often contains terms from the same vocabulary system, but might include terms from multiple vocabulary systems.
HierarchyAn arrangement of things into a tree structure, such that each item (except the top-most) has exactly one parent.
Poly-HierarchyAn arrangement of things into a directed graph structure without loops, such that each item in the graph (except the top-most) has at least one, and can have more than one "parent". There may be multiple top-most items.
ClassificationA vocabulary that describes all the different things that can happen such that the description of any one thing fits into one and only one bucket.
TaxonomyA hierarchical classification; A classification that also has a tree structure associated with it.
TerminologyCustomarily, a terminology is a collection of vocabularies that can describe multiple related things.
SystemA phrase appended to code, vocabulary, classification, taxonomy, terminology or sometimes ontology.  The word system implies a process or method associated with the work. When this word is used, it implies a degree of formality associated with the development of the work.
OntologyIn this context, an ontology is a classification system that describes not only a collection of things, but can also describe the parts of itself.  Under this definition, SNOMED CT is an ontology, and so is LOINC (because some LOINC parts describe LOINC itself), but ICD is not.

1 comment:

  1. Thanks, Keith.

    I'd like to add emphasis that these terms will be used by different people to mean different things, and that no clear definitions can exist because of the wide range in meanings to different people. So caveat emperor.

    In particular, I've been in discussions where Ontology can mean everything from roughly 'what exists' to 'a set of concepts and categories in a subject area or domain that shows their properties and the relations between them' (from wikipedia) on through 'a _formal_ explicit description of concepts in a domain of discourse' (e.g. from the Protege tutorial) and where 'formal' means families of description logics with very explicit axiomatic construction. ICD would probably even meet wikipedia's first definition, but I don't think LOINC would not meet latter.

    Likewise, I would at least change 'Classification' (IMHO a verb) to 'Classification System' (to emphasize that this is a noun), and again note that a wide range of meanings may apply. I think the requirement of 'fits only one bucket' is more a restriction of ICD than it is about classification systems 'in general'. I'm not sure, but I think there is something in statistics speaks of 'One way categorization' (or something like that), that does makes this kind of one-bucket restriction, and my understanding is that it is the statistical needs that are the real driver. That said, in medical software, billing and statistical needs often add this restriction in and that often bleeds over into something that looks like a terminology, vocabulary, or ontology, but behaves very differently. At issue is that the codes are not representing the natural concept of their category labeling, but instead representing artificial categories with a lot of hidden meaning associated with drawing lines between 2 similar things. Thus, IMHO at least, these 'classification systems' should not get 'simple mapping' from/to other term the same way that a reference terms can be. This is because the 'mapping' needs to properly account for all other restrictions that are embedded in the way that items get assigned to categories (aka 'gets classified') in that system. A simple mapping operation is usually bypassing the other classification rules, and thus invalid in that 'system'.

    I think the bottom line is that for the software engineer (or anyone frankly) if the context isn't clear, and someone is using a word like 'Ontology' and/or 'Classification' and you don't know what it means - its very OK to ask. Never be afraid to ask 'What is xx?' or 'What do you mean by xx?'. But, also expect an imprecise long and potentially incorrect/inconsistent answer. IMHO that's kind of the nature of the beast because no one really understands this fully yet.