Saturday, December 8, 2018

What's an Error?

It's not me, it's you This question came up for me a while back building my first FHIR Server, and it comes up again now.  We're all fairly familiar with exceptions by this point, and logging.

Whether you use log4j2, lockback, slf4j, Java's native logging, C#/.NET logging, et cetera, we all know that there are about 5-7 different levels of logging, from INFO, to WARN, to ERROR and beyond for engineers digging into the code.

There are three kinds of problems depending on the root cause that can result in an exception.  And you probably need to log that exception.  The first two of these problems is an exception that occurs because, well, somebody screwed up.  The last class of problems aren't really problems because you can (and often do) recover from them if you've developed a robust system, either by trying again, or trying a different way.  In these cases, your exceptions are really just that, exceptional cases that need special attention, but the fact that you got the exception was in fact expected.

Even HL7 Version 2 acknowledges that there are two kinds of errors that a message processor can negatively acknowledge (NAK); problems on the application end (AE - Application Errors ), and problems on the other end of the pipe (AR - Application Reject).  The former are problems that the receiving application encounters in trying to process the content that aren't (at least at first inspection) caused by the sender, and the sender might try again later, the latter are problems where the sender shouldn't bother trying again with until they've fixed the problem on their end.

Similarly, in the RESTful / HTTP world, there are 4XX series errors (your input sucks buddy, fix it), and 5XX errors (something's wrong on my end here, sorry for the hassle).

Distinguishing between the two is also challenging, because folks don't always take proper care with error reporting.  Even good frameworks (like HAPI on FHIR), report 5XX series errors, when in fact, the problem is NOT with the server, but with the sender (as happens when you try to call the server with a query parameter that it doesn't support).

And then, how should these be reported in logs?  Should you use logger.warn() or logger.error()? [By no means should you use logger.info(), that's just plain wrong, but I've seen enough of it).

I've got some thoughts on this:
If the "error" occurs because of receiver inputs not meeting requirements of the receiver, and the receiver a) detects it, and b) reports it, this is NORMAL operation of a well running system.  These kinds of issues shouldn't be reported using logger.error().  BUT, the maintainer of that well-running system will want to know that these problems are occurring at the time of detection, and so they should be reported using logger.warn().  In some cases, a high number of warnings occurring at the very least indicates that something is broken downstream that someone (often not the maintainer) should be looking into because things could be done better.  At the very least, this gives the system maintainer an opportunity to help their customer do that (and perhaps reduce burden on the system as well).

If, on the other hand, the problem was detected, and it was something the service expected to succeed but didn't, then it should use logger.error() (at least).  Which brings me to my next question, how do you decide between logger.error() and logger.fatal().  Well, this is kind of like the difference between the engine light on your car, and (should you have one), the oil lamp.  The former reports an error that you can probably ignore for a bit, the latter reports a condition that likely needs immediate attention (or you could destroy your engine).  A fatal error is one that indicates that continued proper operation of the system is at risk, possibly even extreme risk.  Whereas, an error is one that says, hey, under these circumstances, X doesn't work.  But Y and Z might very well continue to work (I made it home 30 miles with a cylinder misfire the other day, but my daughter barely made it 5 miles with no oil in the car).  So, even though cylinder 5 was down, 1-4 and 6 were working well enough that I didn't need to two the vehicle 30 miles to the service station.  I could get it home and drop it off on Monday.  But yeah, that oil lamp left the car dead on the side of the road, and in need of a tow.  So, if the problem is likely to leave a single transaction uncompleted, its an error.  But if it means every transaction for the next hour is going to fail, it's definitely in the fatal category.

While when you log it for your own use, you differentiate, as far as the end user is concerned, it's still an error when you report it back to them.  But HTTP error response codes come to the rescue in identifying where the ID10T is for the error.  So, when you report Errors (on your end) to the end user, you should probably be using 5XX series HTTP response codes, but when you report errors on their end, you should probably be using 4XX series response codes.

Now, what do you do about the last of the three, where something went wrong, but you managed to recover ... say the database connection was lost, but you got another one solidly.  This again goes into warning territory.  In this case, the warning might even go to another log file, because these warnings are things that are causing your system grief.  Bad inputs that you detect aren't really causing you a ton of problems, but recovery from another system's failure almost surely is.  Just think about all the work you had to do just to get retry logic built in the first place.

You might think differently, but well, that's you, and this is me...

     Keith

0 comments:

Post a Comment