Friday, December 20, 2024

Flying

 Last year I wanted to gift myself with some flying lessons.  I finally managed to do that earlier this month, and enrolled in the Top Gun Experience at AeroVenture, a local flying school.  The airport is about 20 miles (as the crow flies) from my house, or about a 35-minute drive.  It's six hours of instruction, 3 on the ground and 3 in the air, so NOT enough for a private pilot license, but enough to get started if you wanted to.

I took my first class on Wednesday.  We spend about 40 minutes in the simulator and talking about flight physics.  This was the simulator setup.  That's a few thousand dollars of equipment and a PC and monitor running the flight simulator application.

On Wednesday, I flew a Piper Warrior, which looks like this from the outside.


And like this on the inside:


Here I am in the student seat with head-gear.

Yesterday I flew a Cesna 172 Skyhawk:

These are the controls:

As part of my ground instruction, we planned the flight to my house, about 15 minutes (flight time) from the airport, and then took the plane out of the hangar.  Here's the overhead view (the instructor was flying while I took the picture).

I have one more flight, the Friday morning after Christmas.  This was an awesome experience. es evidenced by my ear-to-ear grin.















Tuesday, December 17, 2024

A Christman present from ASTP

#HTI2 Part 2 just dropped. At 54 pages, Health Data, Technology, and Interoperability: Protecting Care Access is one of the shortest rules I've seen recently from ASTP/ONC.  You can find it in the FR at https://federalregister.gov/documents/2024/12/17/2024-29683/health-data-technology-and-interoperability-protecting-care-access or in PDF form here https://govinfo.gov/content/pkg/FR-2024-12-17/pdf/2024-29683.pdf

There's just two pages of regulatory text in Protecting Care Access part of #HTI2. 

A definition for "Reproductive Healthcare" is added to section 171 -- which explains in part the rationale for this rule.

And Information Blocking is updated with 171.206 to permit blocking to protect "persons seeking, obtaining, providing, or facilitating reproductive health care are at risk of being potentially exposed to legal action ..."

The text for it starts here https://federalregister.gov/d/2024-29683/p-amd-5

Protecting Care Access truly is a Christmas present.  Thank you ASTP!



Wednesday, December 11, 2024

HTI2 just dropped ... light reading for the holidays

The #HTI2 Final Rule just dropped. At 156 pages, this is light reading for a December final rule from the Assistant Secretary for Thwarting PTO.

You can read the final text PDF of Health Data, Technology, and Interoperability: Trusted Exchange Framework and Common Agreement (TEFCA) here, and on Monday 12/16, here.

The key question for anyone who read the original war and peace version would be, where did over 1000 pages go? The answer provided by ASTP is rather simple: "Comments received in response to other proposals from the HTI-2 Proposed Rule are beyond the scope of this final rule, are still being reviewed and considered, and may be the subject of subsequent final rules..."

So, next year ... maybe.

The key changes are:

  • Complete EHR and EHR Module terms have been removed from #HTI2
  • They finalized the TEFCA Manner Exception in subpart D of part 171 with no revisions.
  • They added 45 CFR part 172 which codifies provisions related to TEFCA
Changes to 170.315 (certification criteria) in #HTI2 are minimal, see https://public-inspection.federalregister.gov/2024-29163.pdf#page=126


Changes to Section 171 add a severability clause and reference definitions from Section 172 (new), see https://public-inspection.federalregister.gov/2024-29163.pdf#page=129



What they DID NOT do in HTI-2 was change what they wrote in HTI-1 here:

The final 27 pages of HTI-2 add section 172 TRUSTED EXCHANGE FRAMEWORK AND COMMON AGREEMENT to regulation. These regulations apply to QHINs in the main, not Health IT providers that are NOT QHINs, and so I'm not going to cover them in detail, but I will cover one topic, delegation. ASTP is permitted under #HTI2 to delegate some of its responsibilities to the RCE (@sequoiaproject). See https://public-inspection.federalregister.gov/2024-29163.pdf#page=139

These include:

Subpart C—QHIN Onboarding and Designation Processes
172.300 Applicability.
172.301 Submission of QHIN application. 
172.302 Review of QHIN application. 
172.303 QHIN approval and Onboarding.
172.304 QHIN Designation.
172.305 Withdrawal of QHIN application. 
172.306 Denial of QHIN application. 
172.307 Re-application. 

Subpart D—Suspension
172.400 Applicability. 
172.401 QHIN suspensions.
172.402 Selective suspension of exchange between QHINs.

Subpart E—Termination 
172.501 QHIN self-termination.
172.503 Termination by mutual agreement.

So, like I said, a light rulemaking from ASTP this Christmas. Sounds like they have more work to do to earn the appelation: Assistant Secretary for Thwarting PTO.

Merry Christmas to all!

Sunday, October 20, 2024

A minor performance improvement

I work on a spring-boot application using embedded Tomcat (several in fact).  A rewrite in the last year of this application tripled the load capacity (that is NOT what this blog post is about).  That load capacity well more than needed to handle the entire load of the system with a single instance, but good systems grow, and there is also the need for excess surge capacity.

Pushing beyond that 3X limit though, the application was having failures under more stressful loads.  Since I was well over the system design requirements (we run multiple instances of the application for redundancy), but it bugged me.  I had a few extra hours of unscheduled time to play with this, so I started a side project to finally track it down.

Shortly into the fourth load test (the one that goes just beyond that 3X load mark), the application is restarted by AWS after failing 3 consecutive health checks.  I'm supposed to get a log message about this shutdown, but don't.  That turns out to be a signal propagation issue combined with a timing issue (my logger is running in a small separate background process).

The key problem wasn't a memory issue, threading issue, or a CPU load issue.  Heap memory was 1.4Gb, well under the 8Gb max, 4Gb allocated heap size.  Threads were running around 40-50 threads, well below the worker threads limit.  CPU load at application startup might hit 50%, but after that was well below 20% at the time of the health check failure. Even the number of in-use TCP connections was well below any server-imposed limits. 

As it turned out, the issue was not about the number of in-use TCP connections, but rather on the number of available TCP connections.  The root cause was keep-alives in both the application and transport layers.  HTTP 1.1 (the application protocol) supports the keep-alive header, which allows a single connection to be used for multiple requests, saving startup time.  HTTP keep-alive prevents a socket from being closed after the response is sent back, so that it can be reused.  Lower on the network stack, TCP also supports keep-alive, which has a slightly different meaning. This ensures that an open socket that is unused for a while still remains open on both sides (by having the client or server send keep-alive packets).  

The server I'm working on has some long running requests that can take up to 10 minutes before a response is provided (the server is an intermediary talking to other, third-party services).  So long running TCP keep-alives are essential for correct operation, otherwise the poor client drops the connection before ever getting a response. AWS load balancers have a default setting for TCP of 350 seconds for TCP keep-alive, but this can be changed.  I recently had to adjust those settings to support a 10-minute request time (though that wasn't the root cause, the default setting of 350 was bad enough).

What it boiled down to was that I'd run out of available connections, not because connections were "active", but because they were being kept-alive by a combination of Application layer (HTTP) and Transport layer (TCP) configuration.  I could not rely on the HTTP client NOT using application layer keep-alives, and the system needed TCP keep-alives for one of its long-running services.  Given an actual request takes 10s to complete in worst case time, having a socket be unavailable for use for 5 minutes or longer is just too much.  That's 30 times as many connections unavailable than actually necessary.

That system has a security requirement that every request/response to the server goes through TLS negotiation, instead of reusing existing connections.  There was never ANY NEED or use for HTTP keep-alive. However, embedded tomcat supports these by default in spring boot.  The simple solution was to set server.tomcat.max-keep-alive-requests = 1.  

Initial testing confirmed that resolved the capacity problem.  Continued tested shows another 10X improvement in capacity, for a combined 30X load capacity gain with CPU and memory still under 50% utilization. As a result of these efforts, I can cut CPU costs by a good bit in production by reducing the number of redundant tasks and lower memory requirements by 50%.

My next minor annoyance (side project) is task startup time.

     Keith

P.S. I don't claim to be a network engineer, or a crypto expert, but given the amount of time I spend on networking and crypto in my day job, I might as well claim those skills.


Friday, October 18, 2024

Bouncy Castle FIPS (BCFIPS) 2.0 Upgrade

I've mentioned Bouncy Castle a few times in this blog over the past year.

The easiest major version upgrade yet I've ever had to execute was upgrading from BC-FIPS 1.X to 2.X.  New in Bouncy Castle 2.0 is certification under FIPS 140-3 instead of FIPS 140-2 (all new certifications follow NIST 140-3 requirements).  It also includes support for Java 21 as well as Java 17 and prior releases.  You can find the NIST Certificate details here: 4743

Really, all I needed to do was update my pom.xml files.  Smoothest major upgrade ever.

Well, technically, I did have to do a couple of other things.

1. Download bc-fips-2.0.0.jar into my project so that I could use it in local calls to Java's KeyTool (I have to convert a jks store to bcfks format in my build process.

2. Add the jar files to my Docker Image.  BC-FIPS (at in 1.x versions) cannot be rolled up into an Uber-Jar for Spring Boot given changes in the way that jar url handling happens.  This is because the module validation code in BC-FIPS has to be able to access the class data in the JAR file.

These are the file versions you need to change.

Old                                    New
bc-fips-1.0.2.X.jar    bc-fips-2.0.0.jar
bcpkix-fips-1.0.7.jar  bcpkix-fips-2.0.7.jar
bctls-fips-1.0.19.jar  bctls-fips-2.0.19.jar

    Keith


Wednesday, September 11, 2024

Python Requests Timing Out

 Somehow it seems that every edge case for TLS or TCP communications winds up in my queue.  I don't do much Python development personally (almost nothing in fact).  That's a language that fits into the I can read or debug it category, rather than the speak fluently and know the lexicon.  So, when and end user had a problem that they couldn't resolve in Python it took me quite some time to figure it out.

In a nutshell, the problem was that uploads of a file and data using POST and MIME multipart/form-data in Python weren't working for files over a particular size but were for smaller files.  I couldn't reproduce the problem on my local system, or in my development environment, so this was very troubling.  We'd tested this with files of the size this user provides, and so were pretty certain it would work, but in actual practice, they couldn't make it work.

We'd had others upload using other platforms, Java, C#, bash scripts using CURL, and browser-based forms, so we felt that there was nothing specifically wrong in the implementation and seen nothing like this problem.

Symptomatically, we'd see the entire request (which took a while to upload), and it would be processed (which also took a while), and we also had a log of the outbound response, so we knew a 200 OK success response had been sent back, but the user wasn't getting it.  The read was timing out.  This very much looked to me like a firewall problem, and we had the networking folks check it out, but clearly the firewall on the service wasn't closing the socket prematurely.  The request takes about 30 minutes to upload and process, so not a great candidate for debugging via Wireshark, and as I said, the problem did not occur in my development or local environments.

I finally tracked this down yesterday.  It really was a networking issue, but not in the firewall.  Instead, it's a problem in the network gateway.  In the AWS environment, if the network gateway does not have activity on a connection after 350 seconds, the connection is dropped.  Large file processing takes time on the order of file size, and the large files the user was sending took about 10 minutes to process.  While sending the file there was activity, but once it was being processed, there were several minutes of latency, and so the connection was dropped.

Why did this work in Java, Browsers, and with curl?  Well, b/c there's something called TCP Keep Alive packets, which exist entirely to address this sort of issue.  Those application environments have keep alive turned on sufficiently so that it's NOT a problem.  But Python does not.  

The fix is to add this to your imports:

from requests_toolbelt.adapters.socket_options import TCPKeepAliveAdapter

And this to the Session where you are doing long uploads:

    s = grequests.Session()
    keep_alive = TCPKeepAliveAdapter(idle=120, count=50, interval=30)
    s.mount('https://', keep_alive)

This, finally resolve the problem.  You'll note I'm using grequests instead of requests.  The two are generally API compatible, and grequests solves another problem I don't admittedly understand or have much time to debug.  In any case, what I've seen elsewhere indicates that if you've got connection timeout exceptions you cannot otherwise track down, and you are connecting to an AWS endpoint, and you know there's a long delay in processing, this might just be the solution you are looking for.

Tuesday, July 23, 2024

Is MVX a FHIR CodingSystem or an IdentifierSystem

Well, if you look it up, it's a Coding System, and it's published by CDC here.  But in fact, it does both.  It identifies the concepts of Vaccine Manufacturers, but those are also "Entities" in V3 speak, actual things you could put your hands on evidence of (e.g., articles of incorporation), if still conceptual.  So they also serve as identifiers.

When I used to teach CDA (and V3) regularly, I would explain that codes are simply identifiers for concepts.  I ran across this while working on V2 to FHIR because FHIR Immunization resources treat the manufacturer of an immunization as a first-class resource, an organization, but HL7 V2 treats it as a code from a table in a CE or CWE datatype (depending on HL7 version).

What is one to do?  Well, it's actually pretty straightforward.  I downloaded the CVC MVX codes from the link above, write them into a coding system with code being the MVX code, and display being the Manufacturer name.  When I create the Organization, name gets the value of display, and identifier gets the value of code, and system remains the same.  It works like a charm.

     Keith