Sunday, October 20, 2024

A minor performance improvement

I work on a spring-boot application using embedded Tomcat (several in fact).  A rewrite in the last year of this application tripled the load capacity (that is NOT what this blog post is about).  That load capacity well more than needed to handle the entire load of the system with a single instance, but good systems grow, and there is also the need for excess surge capacity.

Pushing beyond that 3X limit though, the application was having failures under more stressful loads.  Since I was well over the system design requirements (we run multiple instances of the application for redundancy), but it bugged me.  I had a few extra hours of unscheduled time to play with this, so I started a side project to finally track it down.

Shortly into the fourth load test (the one that goes just beyond that 3X load mark), the application is restarted by AWS after failing 3 consecutive health checks.  I'm supposed to get a log message about this shutdown, but don't.  That turns out to be a signal propagation issue combined with a timing issue (my logger is running in a small separate background process).

The key problem wasn't a memory issue, threading issue, or a CPU load issue.  Heap memory was 1.4Gb, well under the 8Gb max, 4Gb allocated heap size.  Threads were running around 40-50 threads, well below the worker threads limit.  CPU load at application startup might hit 50%, but after that was well below 20% at the time of the health check failure. Even the number of in-use TCP connections was well below any server-imposed limits. 

As it turned out, the issue was not about the number of in-use TCP connections, but rather on the number of available TCP connections.  The root cause was keep-alives in both the application and transport layers.  HTTP 1.1 (the application protocol) supports the keep-alive header, which allows a single connection to be used for multiple requests, saving startup time.  HTTP keep-alive prevents a socket from being closed after the response is sent back, so that it can be reused.  Lower on the network stack, TCP also supports keep-alive, which has a slightly different meaning. This ensures that an open socket that is unused for a while still remains open on both sides (by having the client or server send keep-alive packets).  

The server I'm working on has some long running requests that can take up to 10 minutes before a response is provided (the server is an intermediary talking to other, third-party services).  So long running TCP keep-alives are essential for correct operation, otherwise the poor client drops the connection before ever getting a response. AWS load balancers have a default setting for TCP of 350 seconds for TCP keep-alive, but this can be changed.  I recently had to adjust those settings to support a 10-minute request time (though that wasn't the root cause, the default setting of 350 was bad enough).

What it boiled down to was that I'd run out of available connections, not because connections were "active", but because they were being kept-alive by a combination of Application layer (HTTP) and Transport layer (TCP) configuration.  I could not rely on the HTTP client NOT using application layer keep-alives, and the system needed TCP keep-alives for one of its long-running services.  Given an actual request takes 10s to complete in worst case time, having a socket be unavailable for use for 5 minutes or longer is just too much.  That's 30 times as many connections unavailable than actually necessary.

That system has a security requirement that every request/response to the server goes through TLS negotiation, instead of reusing existing connections.  There was never ANY NEED or use for HTTP keep-alive. However, embedded tomcat supports these by default in spring boot.  The simple solution was to set server.tomcat.max-keep-alive-requests = 1.  

Initial testing confirmed that resolved the capacity problem.  Continued tested shows another 10X improvement in capacity, for a combined 30X load capacity gain with CPU and memory still under 50% utilization. As a result of these efforts, I can cut CPU costs by a good bit in production by reducing the number of redundant tasks and lower memory requirements by 50%.

My next minor annoyance (side project) is task startup time.

     Keith

P.S. I don't claim to be a network engineer, or a crypto expert, but given the amount of time I spend on networking and crypto in my day job, I might as well claim those skills.


Friday, October 18, 2024

Bouncy Castle FIPS (BCFIPS) 2.0 Upgrade

I've mentioned Bouncy Castle a few times in this blog over the past year.

The easiest major version upgrade yet I've ever had to execute was upgrading from BC-FIPS 1.X to 2.X.  New in Bouncy Castle 2.0 is certification under FIPS 140-3 instead of FIPS 140-2 (all new certifications follow NIST 140-3 requirements).  It also includes support for Java 21 as well as Java 17 and prior releases.  You can find the NIST Certificate details here: 4743

Really, all I needed to do was update my pom.xml files.  Smoothest major upgrade ever.

Well, technically, I did have to do a couple of other things.

1. Download bc-fips-2.0.0.jar into my project so that I could use it in local calls to Java's KeyTool (I have to convert a jks store to bcfks format in my build process.

2. Add the jar files to my Docker Image.  BC-FIPS (at in 1.x versions) cannot be rolled up into an Uber-Jar for Spring Boot given changes in the way that jar url handling happens.  This is because the module validation code in BC-FIPS has to be able to access the class data in the JAR file.

These are the file versions you need to change.

Old                                    New
bc-fips-1.0.2.X.jar    bc-fips-2.0.0.jar
bcpkix-fips-1.0.7.jar  bcpkix-fips-2.0.7.jar
bctls-fips-1.0.19.jar  bctls-fips-2.0.19.jar

    Keith


Wednesday, September 11, 2024

Python Requests Timing Out

 Somehow it seems that every edge case for TLS or TCP communications winds up in my queue.  I don't do much Python development personally (almost nothing in fact).  That's a language that fits into the I can read or debug it category, rather than the speak fluently and know the lexicon.  So, when and end user had a problem that they couldn't resolve in Python it took me quite some time to figure it out.

In a nutshell, the problem was that uploads of a file and data using POST and MIME multipart/form-data in Python weren't working for files over a particular size but were for smaller files.  I couldn't reproduce the problem on my local system, or in my development environment, so this was very troubling.  We'd tested this with files of the size this user provides, and so were pretty certain it would work, but in actual practice, they couldn't make it work.

We'd had others upload using other platforms, Java, C#, bash scripts using CURL, and browser-based forms, so we felt that there was nothing specifically wrong in the implementation and seen nothing like this problem.

Symptomatically, we'd see the entire request (which took a while to upload), and it would be processed (which also took a while), and we also had a log of the outbound response, so we knew a 200 OK success response had been sent back, but the user wasn't getting it.  The read was timing out.  This very much looked to me like a firewall problem, and we had the networking folks check it out, but clearly the firewall on the service wasn't closing the socket prematurely.  The request takes about 30 minutes to upload and process, so not a great candidate for debugging via Wireshark, and as I said, the problem did not occur in my development or local environments.

I finally tracked this down yesterday.  It really was a networking issue, but not in the firewall.  Instead, it's a problem in the network gateway.  In the AWS environment, if the network gateway does not have activity on a connection after 350 seconds, the connection is dropped.  Large file processing takes time on the order of file size, and the large files the user was sending took about 10 minutes to process.  While sending the file there was activity, but once it was being processed, there were several minutes of latency, and so the connection was dropped.

Why did this work in Java, Browsers, and with curl?  Well, b/c there's something called TCP Keep Alive packets, which exist entirely to address this sort of issue.  Those application environments have keep alive turned on sufficiently so that it's NOT a problem.  But Python does not.  

The fix is to add this to your imports:

from requests_toolbelt.adapters.socket_options import TCPKeepAliveAdapter

And this to the Session where you are doing long uploads:

    s = grequests.Session()
    keep_alive = TCPKeepAliveAdapter(idle=120, count=50, interval=30)
    s.mount('https://', keep_alive)

This, finally resolve the problem.  You'll note I'm using grequests instead of requests.  The two are generally API compatible, and grequests solves another problem I don't admittedly understand or have much time to debug.  In any case, what I've seen elsewhere indicates that if you've got connection timeout exceptions you cannot otherwise track down, and you are connecting to an AWS endpoint, and you know there's a long delay in processing, this might just be the solution you are looking for.

Tuesday, July 23, 2024

Is MVX a FHIR CodingSystem or an IdentifierSystem

Well, if you look it up, it's a Coding System, and it's published by CDC here.  But in fact, it does both.  It identifies the concepts of Vaccine Manufacturers, but those are also "Entities" in V3 speak, actual things you could put your hands on evidence of (e.g., articles of incorporation), if still conceptual.  So they also serve as identifiers.

When I used to teach CDA (and V3) regularly, I would explain that codes are simply identifiers for concepts.  I ran across this while working on V2 to FHIR because FHIR Immunization resources treat the manufacturer of an immunization as a first-class resource, an organization, but HL7 V2 treats it as a code from a table in a CE or CWE datatype (depending on HL7 version).

What is one to do?  Well, it's actually pretty straightforward.  I downloaded the CVC MVX codes from the link above, write them into a coding system with code being the MVX code, and display being the Manufacturer name.  When I create the Organization, name gets the value of display, and identifier gets the value of code, and system remains the same.  It works like a charm.

     Keith


Monday, July 8, 2024

Automating V2toFHIR conversion and Testing

Yesterday morning morning I talked about automating my V2 to FHIR testing.  Now I'm going to talk about automating conversions.  I had to write a segment converter (more like four) before I finally found the right conversion pattern to base the automation on.

It sort of looks like this:

public class PIDParser extends AbstractSegmentParser {
    @Override
    public void parse(Segment pid) throws HL7Exception {
        if (isEmpty(pid)) {
            return;
        }

        Patient patient = this.createResource(Patient.class);
        // Patient identifier
        addField(pid, 2, Identifier.class, patient::addIdentifier);
        // Patient identifier list
        addFields(pid, 3, Identifier.class, patient::addIdentifier);
        // Patient alternate identifier
        addField(pid, 4, Identifier.class, patient::addIdentifier);
        // Patient name
        addFields(pid, 5, HumanName.class, patient::addName);
        // Mother's maiden name
        addFields(pid, 6, patient, HumanName.class, this::addMothersMaidenName);
        // Patient Birth Time
        addField(pid, 7, patient, DateTimeType.class, this::addBirthDateTime);
    
        /* ... and so on for another 30 fields */
    }

Let met help you interpret that.  The addField() method takes a segment and field number, gets the first occurrence of the field.  If it exists, it converts it to the type specified in the third argument (e.g., Identifier.class, HumanName.class, DateTimeType.class).  If the conversion is successful, it then sends the result to the Consumer<Type> or BiConsumer<Patient, Type> lambda which adds it to the resource.

The BiConsumer lambdas are needed to deal with special cases, such as adding Extensions to a resource as in the case for mother's maiden name, or birthDate when the converted result is more precise than to the day.

I want to make a parser annotation driven, so I've created a little annotation that serves both as documentation on the parser, as well as can drive the operation of the parse function.  So I came up with an annotation that is used like this on the PIDParser:

@ComesFrom(path="Patient.identifier", source = "PID-2")
@ComesFrom(path="Patient.identifier", source = "PID-3")
@ComesFrom(path="Patient.identifier", source = "PID-4")
@ComesFrom(path="Patient.name", source = "PID-5")
@ComesFrom(path=
    "Patient.extension('http://hl7.org/fhir/StructureDefinition/patient-mothersMaidenName')",
    source =
"PID-6")
@ComesFrom(path="Patient.birthDate", source = "PID-7")
@ComesFrom(path=
    "Patient.extension('http://hl7.org/fhir/StructureDefinition/patient-birthTime')",
    source =
"PID-7")

The path part of this annotation is written using the Simplified FHIRPath grammar defined in FHIR.  These paths can easily be traversed and set using the getNamedProperty() and setNamedProperty() of Element types.  A simple parse and traverse gets to the property, from which you can determine the type, and then call the appropriate converter on the field, which you can extract from the message by either parsing or using an HAPI V2 Terser on the source property.

I anticipate that ComesFrom will introduce new attributes as I determine patterns for them, for example to set a field to a fixed value, or translate a code from one code system to another before setting it via the destination consumer.  Another attribute may specific the name of a method in the parser to use as a lambda (accessed as a method via reflection on the parser itself).

ComesFrom annotations could obviously be written in CSV format, something like this used by the V2-to-FHIR project team to create the mapping spreadsheets.  In fact, what I'm doing right now for parsers is translating those spreadsheets by hand.  That will change VERY soon.

So, if I have ComesFrom annotations on my parsers, and a bunch of test messages, the next obvious (or maybe not-so-obvious) step is to parse the messages, and create the (almost) appropriate annotations to match those messages.  I did this later that day for the 60+ test messages in my initial test collection (which I expect to grow to by at least one if not two orders of magnitude).  Instead of the 100 assertions I started with that morning, I now have over 2775 that are almost complete.

Here's a small snippet of one:

MSH|^~\&|TEST|MOCK|DEST|DAPP|20240618083101-0400||RSP^K11^RSP_K11|RESULT-01|P|\
  2.5.1|||ER|AL|||||Z32^CDCPHINVS
  @MessageHeader.event.code.exists($this = "K11")
  @MessageHeader.meta.tag.code.exists($this = "RSP")
  @MessageHeader.definition.exists($this.endsWith("RSP_K11"))
  // ComesFrom(path="MessageHeader.source.sender", source = { "MSH-22", "MSH-4", "MSH-3"})
  @MessageHeader.source.sender.exists( \
    /* MSH-22 = null */ \
    /* MSH-4 = MOCK */ \
    /* MSH-3 = TEST */ \
  )
  // ComesFrom(path="MessageHeader.destination.receiver", source = { "MSH-23", "MSH-6", "MSH-5"})
  @MessageHeader.destination.receiver.exists( \
    /* MSH-23 = null */ \
    /* MSH-6 = DAPP */ \
    /* MSH-5 = DEST */ \
  ) 
  // ComesFrom(path="MessageHeader.source.sender.name", source = "MSH-4")
  @MessageHeader.source.sender.name.exists( \
    /* MSH-4 = MOCK */ \
  ) 
  // ComesFrom(path="MessageHeader.source.sender.endpoint", source = "MSH-3") 
  @MessageHeader.source.sender.endpoint.exists( \
    /* MSH-3 = TEST */ \
  )
  // ComesFrom(path="MessageHeader.destination.reciever.name", source = "MSH-6")
  @MessageHeader.destination.reciever.name.exists( \
    /* MSH-6 = DAPP */ \
  )
  // ComesFrom(path="MessageHeader.destination.reciever.endpoint", source = "MSH-5") 
  @MessageHeader.destination.reciever.endpoint.exists( \
    /* MSH-5 = DEST */ \
  )
  // ComesFrom(path="Bundle.timestamp", source = "MSH-7")
  @Bundle.timestamp.exists( 
    /* MSH-7 = 20240618083101-0400 */ 
  )

The stuff in comments (e.g., MSH-7 = 20240618083101-0400) lets the user know what message content is relevant for this assertion test.  Right now they (they = I) would have to manually change those values.  The corrected expression looks like this:

  @Bundle.timestamp.exists( 
    $this = "2024-06-18T08:31:01-04:00" 
  )

I'll get around to fixing that soon.  I have to be careful not to let the converter do the work of writing the test cases, because then they'll just be self-fulfilling prophecies instead of real tests.


Sunday, July 7, 2024

FHIRPath as a testing language

 I'm on a V2-to-FHIR journey writing some code to support translation of V2 messages to FHIR.

Along the way, one of the challenges I have is getting the testing framework set up so that it's easy to write assertions about conversions.  HL7 Messages are full of dense text, which makes it hard to read, and even harder to sprinkle them through with assertions, so that:

  1. The assertions about the FHIR translation are close to the HL7 Message text for which they are made (making it easy to verify that the right assertions are present).
  2. The text is loadable from external resources, rather than written in code.
  3. The assertions are written in a language that's easy to get parsing tools for, and 
  4. The assertion language works well with FHIR.
So, the last two are what leads to FHIRpath, which is obvious when you think about it.
FHIRPath is already in use for validating FHIR Resources, so here we go, just validating a bundle of resources.

The HAPI FHIRPath toolkit isn't quite as extensible as I'd look without digging into some of the guts (which I did some time back for FluentQuery), so about all I can do is write my own resource resolver to provide some customizations, even though I VERY MUCH WANT the ability to add constants and my own functions without having to rewrite some of the native R4 FhirPath code.  But, realistically, I just need to replace a few leading expressions to deal with my worst cases.

As a user, you want to write MessageHeader, or Patient, or Observation[3], or maybe even Bundle at the start of your tests.  For example: OperationOutcome.count() = 2.  But instead, you'd have to write something like %context.entry.resource.ofType(OperationOutcome).count() = 2.  Or in the case of the uniquitous Observation %context.entry.resource.ofType(Observation)[3] instead of just Observation[3].  It's about 30 characters added to each test you need to write, and just ONE message conversion probably needs about 500 assertions to fully validate it.  I'm at PID on my first example, and I've got about 100 assertions, and haven't hit the meat of the message yet.  That would be 3000 characters, or at my typing speed, about 8.5 minutes of extra typing.  I figure I can save about 5 seconds per assertion, and I've written 100 in one day.  According to Randal Monroe, I can invest 10 days into making that time savings happen.  Since it took me about an hour, I have 9 days of time to invest in something else (or I could just get back to work).


The next bit is about making HL7 messages easier to split into smaller pieces.  For now, a segment is small enough.  Segments are detectable by this regular expression pattern: /^[A-Z]{2}[A-Z0-9]/.  Any line that begins with that is a segment.  Great, now we're writing a parser (really a stream filter).

Now what?  Well, I need a way to identify assertions.  Let's use the @ character to introduce an assertion.  Anything followed by @ at the start of a line is an assertion, and it can be followed by a FHIRPath and an optional comment.  I suppose we could use # for comments, but FHIRpath already recognizes // comments, so I'll go with that.  Except that I have urls in my messages, and I don't want to have https:// urls split up after //.  So, I'll make the rule that the comment delimiter is two // followed by at least one space character.

Now, I'll borrow from some older programming languages and say that a terminal \ at the end of a line signals continuation of on the next line.  And it would be nice if \ plus some whitespace was the same as ending the line with a \, so I'll throw that in.  And finally, let's ignore leading and trailing whitespace on a line, because whitespace messes up messyages (yes, HL7 messages in text are messy).

Here's an example of how this might look in a message file:

MSH|^~\&|\
  TEST|MOCK|\
  DEST|DAPP|\
  20240618083101-0400||\
  RSP^K11^RSP_K11|\
  RESULT-01|\
  P|2.5.1|||\
  ER|AL|\
  ||||Z32^CDCPHINVS
@MessageHeader.count() = 1 // It has one message header
@MessageHeader.source.name = "TEST"
@MessageHeader.source.endpoint = "MOCK"
@MessageHeader.destination.name = "DEST"
@MessageHeader.destination.endpoint = "DAPP"

This is about where I'm at now.  Even better would be:

MSH|^~\&|\
  @MessageHeader.count() = 1 // It has one message header
  TEST|MOCK|\
  @MessageHeader.source.name = "TEST"
  @MessageHeader.source.endpoint = "MOCK"
  DEST|DAPP|\
  @MessageHeader.destination.name = "DEST"
  @MessageHeader.destination.endpoint = "DAPP"
  20240618083101-0400||\
  RSP^K11^RSP_K11|\
  RESULT-01|\
  P|2.5.1|||\
  ER|AL|\
  ||||Z32^CDCPHINVS

I'll get to that stage soon enough. I still have 9 more days to burn on automation. You can see though, how FHIRPath makes it easy to write the tests within your sample messages.


Thursday, June 27, 2024

The true edge cases of date/time parsing

 I'm in the process of developing a Java library to implement the V2-to-FHIR datatype conversions.  This is a core component for a V2 to FHIR converter I hope to open source at some point.

I'm using HAPI V2 and HAPI FHIR because Java, and these are the best implementations in Java.

Some interesting learnings:

  1.  HAPI  FHIR and HAPI V2 have overlapping acceptable input ranges.
    In order to provide the broadest support, I'm actually using both.  The V2 parser is easier to manage with the range of acceptable ISO-8601 date time strings, since it's easier to remove any existing punctuation from an ISO-8601 data type (the -, : and T characters that typically show up in XML and JSON time stamps), than it is to insert missing punctuation, so I go with that first with the input string normalized to the 8601 format w/o punctuation.
  2. There are a couple of wonky cases that HAPI FHIR parses but V2 doesn't.  If the hours, minutes or seconds are set to the string "-0", the strings parse in FHIR, but don't in V2.  I'm guessing someone is using parse by punctuation and that Integer.toString("-0") and getting 0 in the FHIR case.
  3. Another interesting case: calling new DateTimeType(s).getValueAsString() doesn't return the same string as the following in all cases (e.g., the "2024-06-25T-0:00:00" case).
         DateTimeType original = DateTimeType(s);
    DateTimeType normalized = new DateTimeType(original.getValue(), 
      original.getPrecision());
         assertEquals(original, normalized);
    This is because HAPI FHIR doesn't renormalize the input string if it is able to be parsed (but it probably should).  I've ensured my converter class does this renormalizing.  
  4. If a time zone isn't present, then it's assumed to be the local time zone in both HAPI V2 and HAPI FHIR.  This presents small problems in testing, but nothing insurmountable.
  5. While the value of 60 is allowed in seconds in FHIR, what HAPI FHIR actually does is role to the next minute.  Again, very edge case, as leap second events are infrequent in real life (only 27 since 1972), and even more so as detected as having impacts during code execution.  GPS aware applications have to deal with them because GPS time doesn't care about leap seconds (although your phone, which synchronizes it's time via GPS may correct for them).

For what it's worth as a side note, issue # 3 above happens with other types such as IntegerType, for example, try this test:
|
assertEquals(new IntegerType("01").getValueAsString(),
             new IntegerType(1).getValueAsString());

What I've done in my converter is after creating a value from a string, ensuring I perform any string-based construction like this:

IntegerType i = new IntegerType("01"); 
i.setValue(i.getValue());

This ensure that I get the expected string representation.  This helps resolve issues with leading zeros or plus signs (e.g., -01, 01, +01, +2).


Wednesday, May 1, 2024

Running BCFIPS in SpringBoot 3.2


I'm rewriting wrappers for a Spring Boot 1.5 application using Spring Boot 3.2 to help me eliminate some older dependencies I cannot even get patches for anymore.  So now I have to make my new Spring Boot application work with the latest Bouncy Castle FIPS code.

I've previously mentioned the NIST Certified Bouncy Castle FIPS TLS libraries in other posts.  SOAP is tremendously complicated to configure and manage.  REST is so much easier, and when you don't need all of the power of a SOAP stack, you can certainly send SOAP messages in XML quite readily using a normal REST transport layer.  You may need to write some extra marshalling code, but for standardized interfaces, that doesn't tend to be a big lift and there are plenty of existing tools to help (e.g., Faster Jackson XML).

There are a few challenges. One of these stems from the need for the BC-FIPS modules to validate themselves on startup.  This uses jar validation processes, but the code running the validation needs to be able to find itself.  That gets complicated by a minor change in Spring Boot 3.2 URL format when it repackages the application into an uber-jar. The URL conflicts with what the BC modules expect, and then they won't validate.  While the fix is a one-liner in BC-FIPS, it's a change to a certified software module, which means that it has to go through processes which make it take longer than anyone would like.

Spring Boot uber class loading has the ability to target classes which need to be unpacked from the Uber-jar before loading. The JAR files will then be unpacked from the uber-jar into your TEMP folder before being loaded, and those will be accessible from a different kind of URL, that of the JAR file, which BCFIPS knows how to resolve.

To make this work, you will need to update your configuration for the spring-boot-maven-plugin to tell it which jar files need to be unpacked.  The necessary lines of code are below:

            <plugin>
                <groupid>org.springframework.boot</groupid>
                <artifactid>spring-boot-maven-plugin</artifactid>
                <configuration>
                    <mainclass><!--your-application-main-class--></mainclass>
                    <requiresunpack>
                        <dependency>
                            <groupid>org.bouncycastle</groupid>
                            <artifactid>bcpkix-fips</artifactid>
                        </dependency>
                        <dependency>
                            <groupid>org.bouncycastle</groupid>
                            <artifactid>bc-fips</artifactid>
                        </dependency>
                        <dependency>
                            <groupid>org.bouncycastle</groupid>
                            <artifactid>bctls-fips</artifactid>
                        </dependency>
                    </requiresunpack>
                </configuration>
                <executions>
                    <execution>
                        <id>default-jar</id>
                        <goals>
                            <goal>repackage</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
Rewriting my wrapper to ditch CXF and modernize to the latest Spring Boot will save a lot on future maintenance efforts and is already making the application much simpler.  There's plenty of glue code needed to run inside the CXF framework that can just be done away with.
As you now know, I had to go back and rework for SNI again too.  It turns out the same fix was basically needed, calling URLConnectionUtil.openConnection().

import org.bouncycastle.jsse.util.URLConnectionUtil;

  ...

  /**
   * This method obtains an SNI enabled Server Socket Factory
   * @param location The URL to obtain a connection for
   * @return An HttpURLConnection that supports SNI if the URL starts with HTTPS or has no URL Scheme.
   * @throws IOException
   */
  public HttpURLConnection getSNIEnabledConnection(URL location) throws IOException {
    String protocol = location.getProtocol();
    if (protocol == null || !protocol.startsWith("https")) {
      // Not HTTPS
      return (HttpURLConnection) location.openConnection();
    }
    URLConnectionUtil util = new URLConnectionUtil(getSSLContext().getSocketFactory());
    HttpsURLConnection conx = (HttpsURLConnection) util.openConnection(location);
    conx.setHostnameVerifier(ClientTlsSupport::verifyHostname);
    return conx;
  }



More fun with SNI and TLS with Akamai edge servers and Bouncy Castle internal_error(80)

Adam Savage "Well, there's your problem"Recently, endpoints that one of the systems I maintain frequently connects to underwent a change in how they hosted their servers.  They were moved into an Akamai edge server framework, which requires use of the Server Name Indicator (SNI) extension in TLS communications.  This isn't routinely enabled in a straight Java JSSE client connection, especially when using Bouncy Castle FIPS.  As I previously stated, you have to configure it in your code.

My guess is that when a request is made without the SNI TLS extension, the Akamai edge environment reports a TLS error.  Sadly, Akamai reports this TLS error using the TLS internal_error(80) fatal alert code instead of the more descriptive missing_extension(109).  That's because missing_extension(109) wasn't defined until TLS 1.3 (which is much more expressive of the reason for failure).

On my side, I had half-enabled SNI, where it was used for the main protocol request and response, but wasn't being used for the requests sent to the underlying authorization protocol (OAuth 2.0).  The reason that was missed was simply because it is a completely independent module.

If you've got WireShark, and you see something like this in your capture, look at the list of extensions.

If server_name extension is missing, as in the above capture, you need to fix your code to use SNI.  Once you've fixed it, it should look like this: 


In any case, if you are using BC FIPS, and get an internal_error(80) while trying to connect to an Akamai Edge Server, and your TLS Client Hello doesn't contain the SNI Extension: Well, there's your problem.



Wednesday, April 10, 2024

The nastiest code I ever wrote

Creating a mock that injects random errors into the expected responses is necessary to ensure that your server resiliency features work as intended. Doing so can be very challenging, especially if your mock has to simulate complex behavior.  FWIW, I call it a mock, because it is, but the endpoint is a test endpoint used by a production environment to verify that the server is operating correctly.

What appears below all-out wins the award for the nastiest, most evil code I've ever written that appears in a production system.

resp = ReflectionUtils.unwrapResponse(resp);
if (resp instanceof ResponseFacade rf) {
  try {
   // Drill through Tomcat internals using reflection to get to the underlying socket and SHUT it down hard.
    NioEndpoint.NioSocketWrapper w
      ReflectionUtils.getField(rf, "response.coyoteResponse.hook.socketWrapper", NioEndpoint.NioSocketWrapper.class);
    ReflectionUtils.attempt(() -> w.getSocket().getIOChannel().shutdownInput());  
    ReflectionUtils.attempt(() -> w.getSocket().getIOChannel().shutdownOutput());
  } catch (Exception e) {
    log.warn("Cannot access socket: {}", e);
  }
} else {
  throw new RuntimeException("Simulated Socket Reset", new SocketException("Connection Reset [simulated]"));
}

It is used for a mock endpoint developed to performance test an application that sends messages to other systems.  The purpose of this is to force a reset on the server socket being used to response to a request.  The mock endpoint receives a request, and randomly shuts down the socket to help test the resilience and performance of the main application.

This delves deeply into Tomcat internals and uses a few reflection utility functions that I wrote.

The unwrapResponse() function just gets to the inner most HttpServletResponse assuming that is the beast that is closed to Tomcat internals (in my case, it is).

public static HttpServletResponse unwrapResponse(HttpServletResponse servletResp) {
  return ((servletResp instanceof HttpServletResponseWrapper w) ?
    unwrapResponse((HttpServletResponse)w.getResponse()) : servletResp);
}

The getField() method breaks the string given as the second argument into parts around the period (.) character, and then recursively extracts the object at that field, returning it.  The last argument is the type of the object.  I'll leave this one as an exercise for the reader.  These lines of code First get the socket from socket wrapper of the action hook in the CoyoteResponse of the Response object that's backing the outermost HttpServletResponse object given in resp.

The attempt() method just calls the callable (which is allowed to throw an exception), and returns, regardless of whether or not the attempt succeeds.  This is because I don't want this code to fail.

What's inside that callable is the critical code.  That uses the socketWrapper  to get the NIOChannel, and force a shutdown on both sides of the socket.  This is the hardest I can crash an inbound connection without messing with operating system calls, and it works in Tomcat 8.5, and I'm pretty certain it will work in Tomcat 10 as well.  I'll know in a couple of days once I get done migrating.

This is truly evil code; however, it wouldn't exist without a purpose.  That is to test how the system responds when a socket goes randomly belly-up during a performance test.  Yes, this code is randomly triggered during performance testing whilst the application is attempting to make requests.  As much as possible, I've basically installed a robot walking through the server room and randomly pulling cables from the endpoint the application is trying to get to, without having any access to that room or the cables.

This is basically the code that it getting invoked by this ugliness (that link is for Java 8, I'm actually using JDK 17 and previously used JDK11).  You might find this technique useful yourself.  I find it's a lot better than having to pull my wired connection a dozen or more times for my weekly performance test.  Yes, this runs in a scheduled automated test at least once a week.