Friday, June 15, 2012

Simple Math - A Language for Expressions

One of the ongoing issues in Query Health and in HL7 in general is what language should be used in expressions.  Should it be JavaScript, GELLO, XPath or perhaps even an XML representation.  My answer is none of the above.

Most of the execution environments that Query Health can work in already support computation.  I don't want to introduce a new language interpreter unnecessarily.  I can imagine the discussion I'd have with a few data center folks explaining that I need them to modify their SQL environment to include a JavaScript interpreter, for example.  And GELLO, while it is a standard, lacks implementations.  Other representations have similar challenges.  Finally, the idea of using XML to represent simple mathematical expressions that people are already familiar with bugs me.  Yes, I do program in XML (XSLT to be specific).  I can tell you personally, that I hate the additional noise provided by that extra syntax.  It makes code really hard to read.

For Query Health, I can see a few cases where we need some fairly simple arithmetic computations.  To move beyond counting and support higher statistical functions we would need addition, subtraction, multiplication, division and exponentiation.  We probably don't need transcendental functions, but there are some operations for which we do need some simple functions over one or two variables.

What I really want is a simple mathematical expression language that I can easily implement in programming environments as diverse as SQL, XQuery, XSLT/XPath, C, C#, C++, Java, JavaScript, Perl, Ruby and any other programming language you can think of.  It needs to be simple enough that one could perform a series of search and replace operations to turn it into executable code.  Fortunately, most programming languages today share quite a bit of common syntax when it comes to expressions.

So, I decided to create a small language, I'm going to call simple math.  The point of simple math is to make it easy to write math expressions in a way that can be translated into a variety of execution languages.  Simple math is NOT designed to be computed by itself (although it could be).  It is designed to be transformed into something that can be executed in a variety of programming environments.

From an implementation perspective, the following are my requirements for "Simple Math":
  1. Definition of variables that can be bound to an implementation specific object (a database table, or class instance).  
  2. Definitions of variables that can be bound to an implementation specific object containing fixed (constant) values.
  3. The ability to call a function to perform some operation.
  4. Simple Arithmetic (addition, subtraction, multiplication, division, exponentiation)
  5. Parenthesis
  6. No side effects
  7. Easily translatable to executable code using simple search and replace operations.
Having worked in over a dozen programming languages, there are some common capabilities across most that could be simply reused.

Constants:
So far, I'm just doing math, so all I need for literals are the usual representations of numbers.  There are two basic types:  Integer and real in Simple Math.  Integers start with a digit and are followed by consecutive digits, and may be preceded by a - sign.  Real numbers are represented as integers (minimally a single digit including 0) that are followed by an optional decimal point and decimal part (minimally .0) and an optional exponent part separated by the letter e (or E) and a required sign and decimal number indicating the power of 10 to which the number is raised.

Implementations must support at least IEEE single precision arithmetic on real numbers, and follow the IEEE rules for computation.

Operators:
The operators for addition +, subtraction -, multiplication *, and division /, as well as parenthesis are pretty commonly used.  Some languages have operators for exponentiation, and others use function calls.  The languages that have these operators also have a function call notation, so I won't try to choose an operator for exponentiation.

The integer division and modulus operators are available in some programming languages, but do not have a consistent syntax.  Since they can be supported by function calls, I'd skip picking operators for them.

Parenthesis () are commonly understood when used to change operator precedence across almost all programming languages.

We don't need array operators for what I'm calling "simple computation".  I could be convinced otherwise, but I think we can skip this for most cases.  If we do get into arrays, we'd also need to understand what the index is for the first item, which varies by language.  Skipping arrays also avoids that mess.

We do, I think, need to have an operator for member access to an object, and most languages already support the . as the member access operator.  We don't need to get into differences between pointers, references and values, because as your recall, one of the requirements is to be side effect free.

I think we can skip the comparators, but again, I could possibly be convinced otherwise with a good use case for them.  They usually have a boolean result (but can be otherwise in the presence of exceptional values like NaN, Infinity or NULL).  And given that they have a boolean result, at least in the QH context, we can use existing boolean and range selection capabilities in the model to achieve the same result as comparison operators in an expression language.

Identifiers
Identifiers for variables and constants is also not too tricky (until you get to SQL). Most programming languages require an initial letter, and can be followed by any number of letters and digits.  In addition to the usual letters, most also allow an underscore or other punctuation characters in identifiers, but these characters can also have special meaning in other contexts.  For example, Java allows $ and _, C only allows _.  XML allows the _ and the : in names, but : has special meaning in many cases (as a namespace delimiter for a qualified name).  In SQL, it gets a bit more complicated, because the _ can have special meaning at the beginning of a name, and case could be significant or not in the usual transformation.  It is possible to "quote" identifiers, so that case significance can be preserved, and it is also possible to create a regular expression to support translation of identifiers into quoted identifiers.

One of the challenges of identifiers is avoiding reserved names in the various programming languages.  In SQL, if we "quote" the identifiers, then this is no longer a problem.  Looking across C, C++, C#, Java, JavaScript, Perl, Ruby, SQL, and VB I came up with a list of about 300 distinct keywords (ignoring case).  Then I realized that making a list of prohibited keywords would likely not work because some language that I hadn't considered wouldn't be supported.  The simple answer then is to ensure that there is some transformation of identifiers to a non-keyword form that works in general.  An example would be the arbitrary rule that no "simple math" identifier be permitted to end in a double underscore.  One could then append a double underscore to every identifier that matched a keyword in your chosen implementation language, and be sure that it would not collide with another identifier.

Functions
Function calls provide some of the more complicate arithmetic, including min/max functions, floor/ceiling/rounding, computing powers and logarithms, modular arithmetic, et cetera.  I like the JavaScript Math object as a basic starting point, but I think I'd limit min/max to the two argument forms.  I don't know that we need the trignometric functions for the kind of things we need to compute, but again, I could be convinced otherwise.  To that, I'd add a div() and mod() function to support modulo arithmetic.

These basic math functions would be preceded by an identifier (Math) and a dot . to indicate that they come from the basic math library.

There, now I have the basic idea behind Simple Math written down.  Now to standardize it.  Anyone looking to create a new standard?   I'm going to be speaking at OMG's Healthcare Information Day in Cambridge next week, maybe I can get them to take it on.

17 comments:

  1. So you are going to create a new standard, that will have no implementations, because GELLO does not have implementations??? GELLO does in fact have implementations and this makes no sense.

    ReplyDelete
    Replies
    1. I think you missed the point Andrew.
      1. I said GELLO implementations are lacking, not that they aren't available. The other challenge with GELLO is also more difficult than implementing JavaScript in many environments.
      2. This would be an expression language that could be implemented in any programming environment, not one that requires an "interpreter" to implement, but rather a translator to the environment -- so, if your environment is GELLO, you could translate to that as well.

      Delete
    2. GELLO is not quite as hard to implement as OCL, but what it does deliver is a language more tuned to Health IT. Perhaps the main reason for lack of implementations is that perhaps Health IT doesn't draw the best of industry minds (as in computer science) that it should.

      I would certainly challenge the statement that GELLO is more difficult to implement than JavaScript. They are entirely different beasts. JS has weak typing and is a functional language with many challenges, some of those being efficient execution and garbage collection. The weak typing of JS should ring alarm bells for Health IT specialists where the emphasis should be on the the accuracy/rigor of what is being evaluated, not on ubiquitous ease of use.

      Delete
  2. No I get the point, but we have a standard expression language ie GELLO that could be translated to other languages. Why would yet another standard be any better. If you just want math (but requirements always creep) then just use a subset of GELLO. Whatever the language you will need to translate it to the target language and this is what a complier does. You require a compiler/interpreter for your proposal in exactly the same way.

    ReplyDelete
    Replies
    1. If it's done right, not only is it a subset of GELLO, but also JavaScript, C++ and other languages. That's what makes it an interesting (and perhaps better) idea. And while the subset technically requires a Compiler, it would need neither a lexical analyzer, nor a complex grammar in order to implement the end result in a target system.

      So, if you were targeting GELLO, it would work, but it would also work if you were targeting JavaScript, C++ or SQL, and you wouldn't have to deal with the complexity of integrating a compiler or interpreter for another language, just a few well targeted line edits.

      Oh, and software engineers would understand it, because it's already a subset of a language they know.

      Delete
    2. For the simple case it could be so, but alas the simple case is not enough and your requirements are not simple as they talk about interfacing existing data/objects and before you know it you need control structures and set functions etc etc

      GELLO simplifies it when you combine it with ISO datatypes as you have high level functionality to make writing the logic easier, like, for example, the implies function on coded values, which hides the complexity of calling terminology services.

      A simple GELLO subset for simple use cases, would at least align it with an existing standard. I suspect it would end up very simple and you probably need real use cases to see what is required, as when we have implemented GELLO against complex systems with real requirements we usually end up exersizing a fair portion of the language, especially when terminology is involved. I suspect GELLO started with exactly the same ideas you have, but the requirements tend to grow when the rubber meets the road. One of the issues in using an existing language is that they allow people to do to much, eg open a socket and pump out confidential data, and GELLO was designed as read only to guard against this.

      Delete
    3. Ah, but looking at the specific use cases I'm working with, I don't need control structures, nor set functions. I'm certainly willing to work with a subset of GELLO, but you've yet to convince me that I can solve the problem by using GELLO as a whole. It solves a much bigger problem than I want to address.

      I could use GELLO, JavaScript or another language to deal with the challenge, and it would work (we've already shown that in QH). The problem isn't one of finding a solution for one platform, but rather finding a solution that fits multiple ones.

      The whole point of Simple Math is to allow HQMF to be transformed into a program that can be executed. One of the most challenging platforms is SQL. I want to be able to select and produce results which can compute with objects selected by data criteria. That means being able to put these "Simple Math" expressions in places where other expressions are allowed in SQL, in creating a View, a SELECT statement, or similar condition. Then I want the same kind of facility to work in XQuery. And in Java, C++ and .Net; and Perl, Ruby and JavaScript.

      I cannot see how to fit all of GELLO into that. Or JavaScript. Or any "programming" language that exists.

      What I want is define subset of the programming languages (and deployment environments) that do exist so that expressions in HQMF can be used directly without much change, and certainly without having to include something as cumbersome as a parser and lexical analyzer.

      Delete
    4. GELLO is designed to do that and gives you the same ability as SQL to select records from collections. Translation of an expression in one language into another language (If the other language is machine code its executable) is what a complier does and you need a tool suitable for the task. Anything else will be a hack that only works for simple cases. Unless you use a lexer/parser you will not have a proper solution as the problem you describe was solved a long time ago and requires these tools. Hand coding transforms will fall over in all but the simplest cases. An xml language to describe the transform becomes very messy, and thats what a grammer does well.

      You need an object model of the target of the transform so you can reference the values you want to transform and any language has to be able to reference this model. In GELLO you plug that model into the context statement and write the expression. Any other language would need to replicate this functionality.

      Delete
  3. Keith, have a look at Xpath. I also hate XML syntax, but Xpath includes paths to reference data items, and all the basic first order predicate logic you need, as well as arithmetic and relational operators.

    ReplyDelete
    Replies
    1. It's close, but again, too much. A subset of XPath would work well, but I'd also like that same subset to be a subset of several other langauges too.

      Delete
  4. One approach might be to implement Simple Math as a suite of command-line utilities/shell scripts and invoke these directly from each environment. I realise that you would like complete platform-independence, but I'm not sure if that's actually achievable as you'd need a compiler for each O/S anyway. With shell scripting, you would also remove the DBA Gatekeeper problem.

    As you intimate, SQL might be the show-stopper here; the best you will be able to do from within that environment is to hope that there are RDBMS objects that enable you to invoke command line utilities (e.g. the xp_cmdshell stored procedure in SQL Server).

    Actually, all this just makes me grateful to spend most of my development time these days in an environment where I can use the same dialect to query relational data, XML and application objects and execution still takes place in the most appropriate process.

    ReplyDelete
    Replies
    1. What is that environment? (I'm reminded of .NET 3.5's LINQ.)

      Delete
    2. Yes, it's LINQ. Just a shame, for many, that it's a Microsoft-only technology; otherwise it would be worth creating domain-specific extensions (LINQ to CDA, LINQ to Archetypes, LINQ to FHIR, etc) that would make other methods querying health data obsolete.

      ...but, I digress. I'm still stuck with external shell or API (Calculator) calls as a means of implementing Simple Math; language reductions is a hard concept to grasp!

      Delete
  5. Not sure if you saw this but I'll repost it here for reference:

    http://wiki.siframework.org/JavaScript+Execution+Environments

    I think starting from scratch with a whole new language is the wrong approach, much better to start from an existing language that is already fully specified (syntax, precedence rules etc) and subset that. JavaScript is everywhere these days and is the obvious candidate for a starting point: the C/Java-like syntax is familiar to developers and tools (many of them open source) abound for both executing and parsing the language (see link above). Adopting JavaScript as the starting point for a subset would offer a leg-up to implementers rather than introducing an additional hurdle.

    Finally I'd note that a subset of JavaScript would be just as amenable to transformation as something new, arguably more so due to the existence of tools to help.

    ReplyDelete
    Replies
    1. I tend to agree here. I think the focus should be on utilizing existing resources and languages as much as possible. The constant moving of the goal posts in the E-health industry is not the best utilization of resources. It's unbelievable the amount of wheel reinventing that goes on.

      Delete
    2. The point of Simple Math is to create a subset of expressions that work in Java (and its relations), C (and its relations), SQL, Perl, Ruby, et cetera.

      Delete
    3. This is an informative post review. I am so pleased to get this post article and nice information. I was looking forward to get such a post which is very helpful to us. A big thank for posting this article in this website. Keep it up.
      scripts, advertisement mind control

      Delete