Pushing beyond that 3X limit though, the application was having failures under more stressful loads. Since I was well over the system design requirements (we run multiple instances of the application for redundancy), but it bugged me. I had a few extra hours of unscheduled time to play with this, so I started a side project to finally track it down.
Shortly into the fourth load test (the one that goes just beyond that 3X load mark), the application is restarted by AWS after failing 3 consecutive health checks. I'm supposed to get a log message about this shutdown, but don't. That turns out to be a signal propagation issue combined with a timing issue (my logger is running in a small separate background process).
The key problem wasn't a memory issue, threading issue, or a CPU load issue. Heap memory was 1.4Gb, well under the 8Gb max, 4Gb allocated heap size. Threads were running around 40-50 threads, well below the worker threads limit. CPU load at application startup might hit 50%, but after that was well below 20% at the time of the health check failure. Even the number of in-use TCP connections was well below any server-imposed limits.
As it turned out, the issue was not about the number of in-use TCP connections, but rather on the number of available TCP connections. The root cause was keep-alives in both the application and transport layers. HTTP 1.1 (the application protocol) supports the keep-alive header, which allows a single connection to be used for multiple requests, saving startup time. HTTP keep-alive prevents a socket from being closed after the response is sent back, so that it can be reused. Lower on the network stack, TCP also supports keep-alive, which has a slightly different meaning. This ensures that an open socket that is unused for a while still remains open on both sides (by having the client or server send keep-alive packets).
The server I'm working on has some long running requests that can take up to 10 minutes before a response is provided (the server is an intermediary talking to other, third-party services). So long running TCP keep-alives are essential for correct operation, otherwise the poor client drops the connection before ever getting a response. AWS load balancers have a default setting for TCP of 350 seconds for TCP keep-alive, but this can be changed. I recently had to adjust those settings to support a 10-minute request time (though that wasn't the root cause, the default setting of 350 was bad enough).
What it boiled down to was that I'd run out of available connections, not because connections were "active", but because they were being kept-alive by a combination of Application layer (HTTP) and Transport layer (TCP) configuration. I could not rely on the HTTP client NOT using application layer keep-alives, and the system needed TCP keep-alives for one of its long-running services. Given an actual request takes 10s to complete in worst case time, having a socket be unavailable for use for 5 minutes or longer is just too much. That's 30 times as many connections unavailable than actually necessary.
That system has a security requirement that every request/response to the server goes through TLS negotiation, instead of reusing existing connections. There was never ANY NEED or use for HTTP keep-alive. However, embedded tomcat supports these by default in spring boot. The simple solution was to set server.tomcat.max-keep-alive-requests = 1.
Initial testing confirmed that resolved the capacity problem. Continued tested shows another 10X improvement in capacity, for a combined 30X load capacity gain with CPU and memory still under 50% utilization. As a result of these efforts, I can cut CPU costs by a good bit in production by reducing the number of redundant tasks and lower memory requirements by 50%.
My next minor annoyance (side project) is task startup time.
Keith
P.S. I don't claim to be a network engineer, or a crypto expert, but given the amount of time I spend on networking and crypto in my day job, I might as well claim those skills.
No comments:
Post a Comment