Healthcare Standards: Simple Math - A Language for Expressions

Friday, June 15, 2012

Simple Math - A Language for Expressions

One of the ongoing issues in Query Health and in HL7 in general is what language should be used in expressions. Should it be JavaScript, GELLO, XPath or perhaps even an XML representation. My answer is none of the above.

Most of the execution environments that Query Health can work in already support computation. I don't want to introduce a new language interpreter unnecessarily. I can imagine the discussion I'd have with a few data center folks explaining that I need them to modify their SQL environment to include a JavaScript interpreter, for example. And GELLO, while it is a standard, lacks implementations. Other representations have similar challenges. Finally, the idea of using XML to represent simple mathematical expressions that people are already familiar with bugs me. Yes, I do program in XML (XSLT to be specific). I can tell you personally, that I hate the additional noise provided by that extra syntax. It makes code really hard to read.

For Query Health, I can see a few cases where we need some fairly simple arithmetic computations. To move beyond counting and support higher statistical functions we would need addition, subtraction, multiplication, division and exponentiation. We probably don't need transcendental functions, but there are some operations for which we do need some simple functions over one or two variables.

What I really want is a simple mathematical expression language that I can easily implement in programming environments as diverse as SQL, XQuery, XSLT/XPath, C, C#, C++, Java, JavaScript, Perl, Ruby and any other programming language you can think of. It needs to be simple enough that one could perform a series of search and replace operations to turn it into executable code. Fortunately, most programming languages today share quite a bit of common syntax when it comes to expressions.

So, I decided to create a small language, I'm going to call simple math. The point of simple math is to make it easy to write math expressions in a way that can be translated into a variety of execution languages. Simple math is NOT designed to be computed by itself (although it could be). It is designed to be transformed into something that can be executed in a variety of programming environments.

From an implementation perspective, the following are my requirements for "Simple Math":

Definition of variables that can be bound to an implementation specific object (a database table, or class instance).
Definitions of variables that can be bound to an implementation specific object containing fixed (constant) values.
The ability to call a function to perform some operation.
Simple Arithmetic (addition, subtraction, multiplication, division, exponentiation)
Parenthesis
No side effects
Easily translatable to executable code using simple search and replace operations.

Having worked in over a dozen programming languages, there are some common capabilities across most that could be simply reused.

Constants:
So far, I'm just doing math, so all I need for literals are the usual representations of numbers. There are two basic types: Integer and real in Simple Math. Integers start with a digit and are followed by consecutive digits, and may be preceded by a - sign. Real numbers are represented as integers (minimally a single digit including 0) that are followed by an optional decimal point and decimal part (minimally .0) and an optional exponent part separated by the letter e (or E) and a required sign and decimal number indicating the power of 10 to which the number is raised.

Implementations must support at least IEEE single precision arithmetic on real numbers, and follow the IEEE rules for computation.

Operators:
The operators for addition +, subtraction -, multiplication *, and division /, as well as parenthesis are pretty commonly used. Some languages have operators for exponentiation, and others use function calls. The languages that have these operators also have a function call notation, so I won't try to choose an operator for exponentiation.

The integer division and modulus operators are available in some programming languages, but do not have a consistent syntax. Since they can be supported by function calls, I'd skip picking operators for them.

Parenthesis () are commonly understood when used to change operator precedence across almost all programming languages.

We don't need array operators for what I'm calling "simple computation". I could be convinced otherwise, but I think we can skip this for most cases. If we do get into arrays, we'd also need to understand what the index is for the first item, which varies by language. Skipping arrays also avoids that mess.

We do, I think, need to have an operator for member access to an object, and most languages already support the . as the member access operator. We don't need to get into differences between pointers, references and values, because as your recall, one of the requirements is to be side effect free.

I think we can skip the comparators, but again, I could possibly be convinced otherwise with a good use case for them. They usually have a boolean result (but can be otherwise in the presence of exceptional values like NaN, Infinity or NULL). And given that they have a boolean result, at least in the QH context, we can use existing boolean and range selection capabilities in the model to achieve the same result as comparison operators in an expression language.

Identifiers
Identifiers for variables and constants is also not too tricky (until you get to SQL). Most programming languages require an initial letter, and can be followed by any number of letters and digits. In addition to the usual letters, most also allow an underscore or other punctuation characters in identifiers, but these characters can also have special meaning in other contexts. For example, Java allows $ and _, C only allows _. XML allows the _ and the : in names, but : has special meaning in many cases (as a namespace delimiter for a qualified name). In SQL, it gets a bit more complicated, because the _ can have special meaning at the beginning of a name, and case could be significant or not in the usual transformation. It is possible to "quote" identifiers, so that case significance can be preserved, and it is also possible to create a regular expression to support translation of identifiers into quoted identifiers.

One of the challenges of identifiers is avoiding reserved names in the various programming languages. In SQL, if we "quote" the identifiers, then this is no longer a problem. Looking across C, C++, C#, Java, JavaScript, Perl, Ruby, SQL, and VB I came up with a list of about 300 distinct keywords (ignoring case). Then I realized that making a list of prohibited keywords would likely not work because some language that I hadn't considered wouldn't be supported. The simple answer then is to ensure that there is some transformation of identifiers to a non-keyword form that works in general. An example would be the arbitrary rule that no "simple math" identifier be permitted to end in a double underscore. One could then append a double underscore to every identifier that matched a keyword in your chosen implementation language, and be sure that it would not collide with another identifier.

Functions
Function calls provide some of the more complicate arithmetic, including min/max functions, floor/ceiling/rounding, computing powers and logarithms, modular arithmetic, et cetera. I like the JavaScript Math object as a basic starting point, but I think I'd limit min/max to the two argument forms. I don't know that we need the trignometric functions for the kind of things we need to compute, but again, I could be convinced otherwise. To that, I'd add a div() and mod() function to support modulo arithmetic.

These basic math functions would be preceded by an identifier (Math) and a dot . to indicate that they come from the basic math library.

There, now I have the basic idea behind Simple Math written down. Now to standardize it. Anyone looking to create a new standard? I'm going to be speaking at OMG's Healthcare Information Day in Cambridge next week, maybe I can get them to take it on.

17 comments:

Andrew McIntyreJune 15, 2012 at 11:29 PM
So you are going to create a new standard, that will have no implementations, because GELLO does not have implementations??? GELLO does in fact have implementations and this makes no sense.
ReplyDelete
Replies
Andrew McIntyreJune 16, 2012 at 10:45 PM
No I get the point, but we have a standard expression language ie GELLO that could be translated to other languages. Why would yet another standard be any better. If you just want math (but requirements always creep) then just use a subset of GELLO. Whatever the language you will need to translate it to the target language and this is what a complier does. You require a compiler/interpreter for your proposal in exactly the same way.
ReplyDelete
Replies
thomasbealeJune 17, 2012 at 3:26 PM
Keith, have a look at Xpath. I also hate XML syntax, but Xpath includes paths to reference data items, and all the basic first order predicate logic you need, as well as arithmetic and relational operators.
ReplyDelete
Replies
Peter JordanJune 18, 2012 at 5:11 AM
One approach might be to implement Simple Math as a suite of command-line utilities/shell scripts and invoke these directly from each environment. I realise that you would like complete platform-independence, but I'm not sure if that's actually achievable as you'd need a compiler for each O/S anyway. With shell scripting, you would also remove the DBA Gatekeeper problem.

As you intimate, SQL might be the show-stopper here; the best you will be able to do from within that environment is to hope that there are RDBMS objects that enable you to invoke command line utilities (e.g. the xp_cmdshell stored procedure in SQL Server).

Actually, all this just makes me grateful to spend most of my development time these days in an environment where I can use the same dialect to query relational data, XML and application objects and execution still takes place in the most appropriate process.
ReplyDelete
Replies
Marc HadleyJune 20, 2012 at 9:57 AM
Not sure if you saw this but I'll repost it here for reference:

http://wiki.siframework.org/JavaScript+Execution+Environments

I think starting from scratch with a whole new language is the wrong approach, much better to start from an existing language that is already fully specified (syntax, precedence rules etc) and subset that. JavaScript is everywhere these days and is the obvious candidate for a starting point: the C/Java-like syntax is familiar to developers and tools (many of them open source) abound for both executing and parsing the language (see link above). Adopting JavaScript as the starting point for a subset would offer a leg-up to implementers rather than introducing an additional hurdle.

Finally I'd note that a subset of JavaScript would be just as amenable to transformation as something new, arguably more so due to the existence of tools to help.
ReplyDelete
Replies

Pages

Friday, June 15, 2012

Simple Math - A Language for Expressions

17 comments: