Wednesday, June 5, 2019

Best practices for Logging and Reporting errors in FHIR

Over the years I've developed a number of micro-services implementing and using FHIR APIs.  I've developed a number of best practices for logging and reporting on errors that occur.  Some of these follow.


Logging


  1. If a call to your API is not validly formed, log this as a warning in your service's log.  You detected an error in user input, and handled it properly.  This is NOT an error in your application, it is an error in the calling application.  You DO want to WARN someone that the calling application isn't calling your application correctly.  You don't want to alarm them that your application isn't working right, because in fact, it is working just fine.
  2. If something happened in a downstream API call that prevents the proper functioning of your application (e.g., a database read error), this is improper operation of the system, and is an ERROR preventing your service from operating (even though there's nothing wrong in the service itself), and should be logged as such.  
  3. IF you implement retry logic, then:
    1. Log as warnings any operation that failed but finally succeeded through retry logic.
    2. Log as errors any operation that failed even after retrying.
  4. If an exception was the cause of an error, consider:
    1. If you KNOW the root cause (a value is malformed), say so in the log message, but don't report the stack trace. This will cut unneeded information from your logs, which you will be thankful for later. For example:
      try {
         int value = Integer.parse(fooQueryParameter.getValue());
      } catch (NumberFormatException nfex) {
         LOGGER.warn(
            "Foo query parameter ({}) must be a number.",
            fooQueryParameter.getValue());
      }
    2. If you don't know why the error occurred (there could be multiple reasons), do report the stack trace in the log:
      try (PreparedStatement st = con.prepareStatement(query)) {
         ResultSet result = st.execute();
      } catch (SQLException jex) {
         LOGGER.warn("Unexepected SQL Exception executing {}",
            query, jex);
         throw new InternalErrorException(...);
      }
    3. Consider pruning the stack trace at the top or bottom.  From the bottom because you know your entry points, infrastructure before that probably isn't that useful to you (e.g., tomcat, wildfly).  From the top because details after your code made the call that threw the exception isn't necessarily something you can deal with.  
  5. DO report the query used (and where possible, parameter values in the query) in the log. Consider also reporting the database name when using multiple databases. I have often seen database exceptions like "parameter 1 has invalid type" with no query included, and no values.
  6. Consider how you might implement retry logic in cases of certain kinds of exceptions (e.g., database connection errors).
  7. Use delimiters in your logging output format to make it easier to read them in other tools (e.g., spreadsheets).  I often use tab delimiters between the different items in my logging configuration: e.g.,
    %d{yyyy-MM-dd HH:mm:ss.SSS}\t[%thread]\t%-5level\t%logger{36}\t- %msg%n
  8. Consider reporting times in the log in a timezone that makes sense for your implementation (and more importantly, to your customer).  When your customer reports they had a problem at 9:33am, you want to be able to find issues at that time in the logs without having to compute offsets (e.g., from GMT ... do you know yours).

Reporting Errors in OperationOutcome

  1. Use 400 series errors like 400 or 422 when the fault is on the part of the client (e.g., invalid operation syntax, or unsupported query combination).  NOTE: HAPI on FHIR will report unsupported query combinations as a 500 series error (which I fix using an interceptor or filter).
  2. Use 500 series errors like 500, 503 or 507 when the fault is on the part of your service.
  3. DO tell the calling user what the problem is in easy to understand language, and if possible, include corrective action they can perform to address the issued.
  4. DO NOT include content from Exception.getMessage().
    I sometimes see this:
    catch (Exception e) {
      outcome.addIssue()
        .setSeverity(IssueSeverity.ERROR)
        .setDiagnostics(e.getMessage());
      throw ...
    }
    This is not good behavior.  You often have no clue what is in e.getMessage(), and often no control.  It can leak information about your technology implementation back to the API user, which can expose vulnerabilities (see below).
  5. DO NOT include the stack trace in the OperationOutcome.  This belongs in your logs, but not in the user response.  See the OWASP Error Handling cheat sheet.
  6. For database errors, you might want report the kind of database (e.g., patient chart, provider list), but not the exact name of the database.  Again, you want to be clear, but avoid leaking implementation details.

Use Error Codes

Finally, consider creating error codes (which can be reported to the user).  Report the error code WITH the human readable message.  The value of unique error codes is that:
  1. The error code does tell you where in your code the error occurred, but doesn't expose implementation details.
  2. Error codes can be associated with messages in ways that enable translation to multiple languages
  3. Error codes can also be associated with actions that users can take to correct the error (if it is on their part), or which your operations staff can take to either further diagnose OR correct the error.  For example:  DB001: Cannot access Provider database.
    Then, in your operations guide, you can say things like: "DB001: This message indicates a failure to connect to the Provider database".  Verify that the database services are up and running for the provider database for the customer site ...


0 comments:

Post a Comment