Sunday, October 20, 2024

A minor performance improvement

I work on a spring-boot application using embedded Tomcat (several in fact).  A rewrite in the last year of this application tripled the load capacity (that is NOT what this blog post is about).  That load capacity well more than needed to handle the entire load of the system with a single instance, but good systems grow, and there is also the need for excess surge capacity.

Pushing beyond that 3X limit though, the application was having failures under more stressful loads.  Since I was well over the system design requirements (we run multiple instances of the application for redundancy), but it bugged me.  I had a few extra hours of unscheduled time to play with this, so I started a side project to finally track it down.

Shortly into the fourth load test (the one that goes just beyond that 3X load mark), the application is restarted by AWS after failing 3 consecutive health checks.  I'm supposed to get a log message about this shutdown, but don't.  That turns out to be a signal propagation issue combined with a timing issue (my logger is running in a small separate background process).

The key problem wasn't a memory issue, threading issue, or a CPU load issue.  Heap memory was 1.4Gb, well under the 8Gb max, 4Gb allocated heap size.  Threads were running around 40-50 threads, well below the worker threads limit.  CPU load at application startup might hit 50%, but after that was well below 20% at the time of the health check failure. Even the number of in-use TCP connections was well below any server-imposed limits. 

As it turned out, the issue was not about the number of in-use TCP connections, but rather on the number of available TCP connections.  The root cause was keep-alives in both the application and transport layers.  HTTP 1.1 (the application protocol) supports the keep-alive header, which allows a single connection to be used for multiple requests, saving startup time.  HTTP keep-alive prevents a socket from being closed after the response is sent back, so that it can be reused.  Lower on the network stack, TCP also supports keep-alive, which has a slightly different meaning. This ensures that an open socket that is unused for a while still remains open on both sides (by having the client or server send keep-alive packets).  

The server I'm working on has some long running requests that can take up to 10 minutes before a response is provided (the server is an intermediary talking to other, third-party services).  So long running TCP keep-alives are essential for correct operation, otherwise the poor client drops the connection before ever getting a response. AWS load balancers have a default setting for TCP of 350 seconds for TCP keep-alive, but this can be changed.  I recently had to adjust those settings to support a 10-minute request time (though that wasn't the root cause, the default setting of 350 was bad enough).

What it boiled down to was that I'd run out of available connections, not because connections were "active", but because they were being kept-alive by a combination of Application layer (HTTP) and Transport layer (TCP) configuration.  I could not rely on the HTTP client NOT using application layer keep-alives, and the system needed TCP keep-alives for one of its long-running services.  Given an actual request takes 10s to complete in worst case time, having a socket be unavailable for use for 5 minutes or longer is just too much.  That's 30 times as many connections unavailable than actually necessary.

That system has a security requirement that every request/response to the server goes through TLS negotiation, instead of reusing existing connections.  There was never ANY NEED or use for HTTP keep-alive. However, embedded tomcat supports these by default in spring boot.  The simple solution was to set server.tomcat.max-keep-alive-requests = 1.  

Initial testing confirmed that resolved the capacity problem.  Continued tested shows another 10X improvement in capacity, for a combined 30X load capacity gain with CPU and memory still under 50% utilization. As a result of these efforts, I can cut CPU costs by a good bit in production by reducing the number of redundant tasks and lower memory requirements by 50%.

My next minor annoyance (side project) is task startup time.

     Keith

P.S. I don't claim to be a network engineer, or a crypto expert, but given the amount of time I spend on networking and crypto in my day job, I might as well claim those skills.


Friday, October 18, 2024

Bouncy Castle FIPS (BCFIPS) 2.0 Upgrade

I've mentioned Bouncy Castle a few times in this blog over the past year.

The easiest major version upgrade yet I've ever had to execute was upgrading from BC-FIPS 1.X to 2.X.  New in Bouncy Castle 2.0 is certification under FIPS 140-3 instead of FIPS 140-2 (all new certifications follow NIST 140-3 requirements).  It also includes support for Java 21 as well as Java 17 and prior releases.  You can find the NIST Certificate details here: 4743

Really, all I needed to do was update my pom.xml files.  Smoothest major upgrade ever.

Well, technically, I did have to do a couple of other things.

1. Download bc-fips-2.0.0.jar into my project so that I could use it in local calls to Java's KeyTool (I have to convert a jks store to bcfks format in my build process.

2. Add the jar files to my Docker Image.  BC-FIPS (at in 1.x versions) cannot be rolled up into an Uber-Jar for Spring Boot given changes in the way that jar url handling happens.  This is because the module validation code in BC-FIPS has to be able to access the class data in the JAR file.

These are the file versions you need to change.

Old                                    New
bc-fips-1.0.2.X.jar    bc-fips-2.0.0.jar
bcpkix-fips-1.0.7.jar  bcpkix-fips-2.0.7.jar
bctls-fips-1.0.19.jar  bctls-fips-2.0.19.jar

    Keith


Wednesday, September 11, 2024

Python Requests Timing Out

 Somehow it seems that every edge case for TLS or TCP communications winds up in my queue.  I don't do much Python development personally (almost nothing in fact).  That's a language that fits into the I can read or debug it category, rather than the speak fluently and know the lexicon.  So, when and end user had a problem that they couldn't resolve in Python it took me quite some time to figure it out.

In a nutshell, the problem was that uploads of a file and data using POST and MIME multipart/form-data in Python weren't working for files over a particular size but were for smaller files.  I couldn't reproduce the problem on my local system, or in my development environment, so this was very troubling.  We'd tested this with files of the size this user provides, and so were pretty certain it would work, but in actual practice, they couldn't make it work.

We'd had others upload using other platforms, Java, C#, bash scripts using CURL, and browser-based forms, so we felt that there was nothing specifically wrong in the implementation and seen nothing like this problem.

Symptomatically, we'd see the entire request (which took a while to upload), and it would be processed (which also took a while), and we also had a log of the outbound response, so we knew a 200 OK success response had been sent back, but the user wasn't getting it.  The read was timing out.  This very much looked to me like a firewall problem, and we had the networking folks check it out, but clearly the firewall on the service wasn't closing the socket prematurely.  The request takes about 30 minutes to upload and process, so not a great candidate for debugging via Wireshark, and as I said, the problem did not occur in my development or local environments.

I finally tracked this down yesterday.  It really was a networking issue, but not in the firewall.  Instead, it's a problem in the network gateway.  In the AWS environment, if the network gateway does not have activity on a connection after 350 seconds, the connection is dropped.  Large file processing takes time on the order of file size, and the large files the user was sending took about 10 minutes to process.  While sending the file there was activity, but once it was being processed, there were several minutes of latency, and so the connection was dropped.

Why did this work in Java, Browsers, and with curl?  Well, b/c there's something called TCP Keep Alive packets, which exist entirely to address this sort of issue.  Those application environments have keep alive turned on sufficiently so that it's NOT a problem.  But Python does not.  

The fix is to add this to your imports:

from requests_toolbelt.adapters.socket_options import TCPKeepAliveAdapter

And this to the Session where you are doing long uploads:

    s = grequests.Session()
    keep_alive = TCPKeepAliveAdapter(idle=120, count=50, interval=30)
    s.mount('https://', keep_alive)

This, finally resolve the problem.  You'll note I'm using grequests instead of requests.  The two are generally API compatible, and grequests solves another problem I don't admittedly understand or have much time to debug.  In any case, what I've seen elsewhere indicates that if you've got connection timeout exceptions you cannot otherwise track down, and you are connecting to an AWS endpoint, and you know there's a long delay in processing, this might just be the solution you are looking for.

Tuesday, July 23, 2024

Is MVX a FHIR CodingSystem or an IdentifierSystem

Well, if you look it up, it's a Coding System, and it's published by CDC here.  But in fact, it does both.  It identifies the concepts of Vaccine Manufacturers, but those are also "Entities" in V3 speak, actual things you could put your hands on evidence of (e.g., articles of incorporation), if still conceptual.  So they also serve as identifiers.

When I used to teach CDA (and V3) regularly, I would explain that codes are simply identifiers for concepts.  I ran across this while working on V2 to FHIR because FHIR Immunization resources treat the manufacturer of an immunization as a first-class resource, an organization, but HL7 V2 treats it as a code from a table in a CE or CWE datatype (depending on HL7 version).

What is one to do?  Well, it's actually pretty straightforward.  I downloaded the CVC MVX codes from the link above, write them into a coding system with code being the MVX code, and display being the Manufacturer name.  When I create the Organization, name gets the value of display, and identifier gets the value of code, and system remains the same.  It works like a charm.

     Keith


Monday, July 8, 2024

Automating V2toFHIR conversion and Testing

Yesterday morning morning I talked about automating my V2 to FHIR testing.  Now I'm going to talk about automating conversions.  I had to write a segment converter (more like four) before I finally found the right conversion pattern to base the automation on.

It sort of looks like this:

public class PIDParser extends AbstractSegmentParser {
    @Override
    public void parse(Segment pid) throws HL7Exception {
        if (isEmpty(pid)) {
            return;
        }

        Patient patient = this.createResource(Patient.class);
        // Patient identifier
        addField(pid, 2, Identifier.class, patient::addIdentifier);
        // Patient identifier list
        addFields(pid, 3, Identifier.class, patient::addIdentifier);
        // Patient alternate identifier
        addField(pid, 4, Identifier.class, patient::addIdentifier);
        // Patient name
        addFields(pid, 5, HumanName.class, patient::addName);
        // Mother's maiden name
        addFields(pid, 6, patient, HumanName.class, this::addMothersMaidenName);
        // Patient Birth Time
        addField(pid, 7, patient, DateTimeType.class, this::addBirthDateTime);
    
        /* ... and so on for another 30 fields */
    }

Let met help you interpret that.  The addField() method takes a segment and field number, gets the first occurrence of the field.  If it exists, it converts it to the type specified in the third argument (e.g., Identifier.class, HumanName.class, DateTimeType.class).  If the conversion is successful, it then sends the result to the Consumer<Type> or BiConsumer<Patient, Type> lambda which adds it to the resource.

The BiConsumer lambdas are needed to deal with special cases, such as adding Extensions to a resource as in the case for mother's maiden name, or birthDate when the converted result is more precise than to the day.

I want to make a parser annotation driven, so I've created a little annotation that serves both as documentation on the parser, as well as can drive the operation of the parse function.  So I came up with an annotation that is used like this on the PIDParser:

@ComesFrom(path="Patient.identifier", source = "PID-2")
@ComesFrom(path="Patient.identifier", source = "PID-3")
@ComesFrom(path="Patient.identifier", source = "PID-4")
@ComesFrom(path="Patient.name", source = "PID-5")
@ComesFrom(path=
    "Patient.extension('http://hl7.org/fhir/StructureDefinition/patient-mothersMaidenName')",
    source =
"PID-6")
@ComesFrom(path="Patient.birthDate", source = "PID-7")
@ComesFrom(path=
    "Patient.extension('http://hl7.org/fhir/StructureDefinition/patient-birthTime')",
    source =
"PID-7")

The path part of this annotation is written using the Simplified FHIRPath grammar defined in FHIR.  These paths can easily be traversed and set using the getNamedProperty() and setNamedProperty() of Element types.  A simple parse and traverse gets to the property, from which you can determine the type, and then call the appropriate converter on the field, which you can extract from the message by either parsing or using an HAPI V2 Terser on the source property.

I anticipate that ComesFrom will introduce new attributes as I determine patterns for them, for example to set a field to a fixed value, or translate a code from one code system to another before setting it via the destination consumer.  Another attribute may specific the name of a method in the parser to use as a lambda (accessed as a method via reflection on the parser itself).

ComesFrom annotations could obviously be written in CSV format, something like this used by the V2-to-FHIR project team to create the mapping spreadsheets.  In fact, what I'm doing right now for parsers is translating those spreadsheets by hand.  That will change VERY soon.

So, if I have ComesFrom annotations on my parsers, and a bunch of test messages, the next obvious (or maybe not-so-obvious) step is to parse the messages, and create the (almost) appropriate annotations to match those messages.  I did this later that day for the 60+ test messages in my initial test collection (which I expect to grow to by at least one if not two orders of magnitude).  Instead of the 100 assertions I started with that morning, I now have over 2775 that are almost complete.

Here's a small snippet of one:

MSH|^~\&|TEST|MOCK|DEST|DAPP|20240618083101-0400||RSP^K11^RSP_K11|RESULT-01|P|\
  2.5.1|||ER|AL|||||Z32^CDCPHINVS
  @MessageHeader.event.code.exists($this = "K11")
  @MessageHeader.meta.tag.code.exists($this = "RSP")
  @MessageHeader.definition.exists($this.endsWith("RSP_K11"))
  // ComesFrom(path="MessageHeader.source.sender", source = { "MSH-22", "MSH-4", "MSH-3"})
  @MessageHeader.source.sender.exists( \
    /* MSH-22 = null */ \
    /* MSH-4 = MOCK */ \
    /* MSH-3 = TEST */ \
  )
  // ComesFrom(path="MessageHeader.destination.receiver", source = { "MSH-23", "MSH-6", "MSH-5"})
  @MessageHeader.destination.receiver.exists( \
    /* MSH-23 = null */ \
    /* MSH-6 = DAPP */ \
    /* MSH-5 = DEST */ \
  ) 
  // ComesFrom(path="MessageHeader.source.sender.name", source = "MSH-4")
  @MessageHeader.source.sender.name.exists( \
    /* MSH-4 = MOCK */ \
  ) 
  // ComesFrom(path="MessageHeader.source.sender.endpoint", source = "MSH-3") 
  @MessageHeader.source.sender.endpoint.exists( \
    /* MSH-3 = TEST */ \
  )
  // ComesFrom(path="MessageHeader.destination.reciever.name", source = "MSH-6")
  @MessageHeader.destination.reciever.name.exists( \
    /* MSH-6 = DAPP */ \
  )
  // ComesFrom(path="MessageHeader.destination.reciever.endpoint", source = "MSH-5") 
  @MessageHeader.destination.reciever.endpoint.exists( \
    /* MSH-5 = DEST */ \
  )
  // ComesFrom(path="Bundle.timestamp", source = "MSH-7")
  @Bundle.timestamp.exists( 
    /* MSH-7 = 20240618083101-0400 */ 
  )

The stuff in comments (e.g., MSH-7 = 20240618083101-0400) lets the user know what message content is relevant for this assertion test.  Right now they (they = I) would have to manually change those values.  The corrected expression looks like this:

  @Bundle.timestamp.exists( 
    $this = "2024-06-18T08:31:01-04:00" 
  )

I'll get around to fixing that soon.  I have to be careful not to let the converter do the work of writing the test cases, because then they'll just be self-fulfilling prophecies instead of real tests.


Sunday, July 7, 2024

FHIRPath as a testing language

 I'm on a V2-to-FHIR journey writing some code to support translation of V2 messages to FHIR.

Along the way, one of the challenges I have is getting the testing framework set up so that it's easy to write assertions about conversions.  HL7 Messages are full of dense text, which makes it hard to read, and even harder to sprinkle them through with assertions, so that:

  1. The assertions about the FHIR translation are close to the HL7 Message text for which they are made (making it easy to verify that the right assertions are present).
  2. The text is loadable from external resources, rather than written in code.
  3. The assertions are written in a language that's easy to get parsing tools for, and 
  4. The assertion language works well with FHIR.
So, the last two are what leads to FHIRpath, which is obvious when you think about it.
FHIRPath is already in use for validating FHIR Resources, so here we go, just validating a bundle of resources.

The HAPI FHIRPath toolkit isn't quite as extensible as I'd look without digging into some of the guts (which I did some time back for FluentQuery), so about all I can do is write my own resource resolver to provide some customizations, even though I VERY MUCH WANT the ability to add constants and my own functions without having to rewrite some of the native R4 FhirPath code.  But, realistically, I just need to replace a few leading expressions to deal with my worst cases.

As a user, you want to write MessageHeader, or Patient, or Observation[3], or maybe even Bundle at the start of your tests.  For example: OperationOutcome.count() = 2.  But instead, you'd have to write something like %context.entry.resource.ofType(OperationOutcome).count() = 2.  Or in the case of the uniquitous Observation %context.entry.resource.ofType(Observation)[3] instead of just Observation[3].  It's about 30 characters added to each test you need to write, and just ONE message conversion probably needs about 500 assertions to fully validate it.  I'm at PID on my first example, and I've got about 100 assertions, and haven't hit the meat of the message yet.  That would be 3000 characters, or at my typing speed, about 8.5 minutes of extra typing.  I figure I can save about 5 seconds per assertion, and I've written 100 in one day.  According to Randal Monroe, I can invest 10 days into making that time savings happen.  Since it took me about an hour, I have 9 days of time to invest in something else (or I could just get back to work).


The next bit is about making HL7 messages easier to split into smaller pieces.  For now, a segment is small enough.  Segments are detectable by this regular expression pattern: /^[A-Z]{2}[A-Z0-9]/.  Any line that begins with that is a segment.  Great, now we're writing a parser (really a stream filter).

Now what?  Well, I need a way to identify assertions.  Let's use the @ character to introduce an assertion.  Anything followed by @ at the start of a line is an assertion, and it can be followed by a FHIRPath and an optional comment.  I suppose we could use # for comments, but FHIRpath already recognizes // comments, so I'll go with that.  Except that I have urls in my messages, and I don't want to have https:// urls split up after //.  So, I'll make the rule that the comment delimiter is two // followed by at least one space character.

Now, I'll borrow from some older programming languages and say that a terminal \ at the end of a line signals continuation of on the next line.  And it would be nice if \ plus some whitespace was the same as ending the line with a \, so I'll throw that in.  And finally, let's ignore leading and trailing whitespace on a line, because whitespace messes up messyages (yes, HL7 messages in text are messy).

Here's an example of how this might look in a message file:

MSH|^~\&|\
  TEST|MOCK|\
  DEST|DAPP|\
  20240618083101-0400||\
  RSP^K11^RSP_K11|\
  RESULT-01|\
  P|2.5.1|||\
  ER|AL|\
  ||||Z32^CDCPHINVS
@MessageHeader.count() = 1 // It has one message header
@MessageHeader.source.name = "TEST"
@MessageHeader.source.endpoint = "MOCK"
@MessageHeader.destination.name = "DEST"
@MessageHeader.destination.endpoint = "DAPP"

This is about where I'm at now.  Even better would be:

MSH|^~\&|\
  @MessageHeader.count() = 1 // It has one message header
  TEST|MOCK|\
  @MessageHeader.source.name = "TEST"
  @MessageHeader.source.endpoint = "MOCK"
  DEST|DAPP|\
  @MessageHeader.destination.name = "DEST"
  @MessageHeader.destination.endpoint = "DAPP"
  20240618083101-0400||\
  RSP^K11^RSP_K11|\
  RESULT-01|\
  P|2.5.1|||\
  ER|AL|\
  ||||Z32^CDCPHINVS

I'll get to that stage soon enough. I still have 9 more days to burn on automation. You can see though, how FHIRPath makes it easy to write the tests within your sample messages.


Thursday, June 27, 2024

The true edge cases of date/time parsing

 I'm in the process of developing a Java library to implement the V2-to-FHIR datatype conversions.  This is a core component for a V2 to FHIR converter I hope to open source at some point.

I'm using HAPI V2 and HAPI FHIR because Java, and these are the best implementations in Java.

Some interesting learnings:

  1.  HAPI  FHIR and HAPI V2 have overlapping acceptable input ranges.
    In order to provide the broadest support, I'm actually using both.  The V2 parser is easier to manage with the range of acceptable ISO-8601 date time strings, since it's easier to remove any existing punctuation from an ISO-8601 data type (the -, : and T characters that typically show up in XML and JSON time stamps), than it is to insert missing punctuation, so I go with that first with the input string normalized to the 8601 format w/o punctuation.
  2. There are a couple of wonky cases that HAPI FHIR parses but V2 doesn't.  If the hours, minutes or seconds are set to the string "-0", the strings parse in FHIR, but don't in V2.  I'm guessing someone is using parse by punctuation and that Integer.toString("-0") and getting 0 in the FHIR case.
  3. Another interesting case: calling new DateTimeType(s).getValueAsString() doesn't return the same string as the following in all cases (e.g., the "2024-06-25T-0:00:00" case).
         DateTimeType original = DateTimeType(s);
    DateTimeType normalized = new DateTimeType(original.getValue(), 
      original.getPrecision());
         assertEquals(original, normalized);
    This is because HAPI FHIR doesn't renormalize the input string if it is able to be parsed (but it probably should).  I've ensured my converter class does this renormalizing.  
  4. If a time zone isn't present, then it's assumed to be the local time zone in both HAPI V2 and HAPI FHIR.  This presents small problems in testing, but nothing insurmountable.
  5. While the value of 60 is allowed in seconds in FHIR, what HAPI FHIR actually does is role to the next minute.  Again, very edge case, as leap second events are infrequent in real life (only 27 since 1972), and even more so as detected as having impacts during code execution.  GPS aware applications have to deal with them because GPS time doesn't care about leap seconds (although your phone, which synchronizes it's time via GPS may correct for them).

For what it's worth as a side note, issue # 3 above happens with other types such as IntegerType, for example, try this test:
|
assertEquals(new IntegerType("01").getValueAsString(),
             new IntegerType(1).getValueAsString());

What I've done in my converter is after creating a value from a string, ensuring I perform any string-based construction like this:

IntegerType i = new IntegerType("01"); 
i.setValue(i.getValue());

This ensure that I get the expected string representation.  This helps resolve issues with leading zeros or plus signs (e.g., -01, 01, +01, +2).