Time and time again I’ve seen major failures in information systems that fail under real load. This happens over and over again. It happened again today in my home state when the state’s vaccine scheduling system failed shortly after being opened up for scheduling. I tried using it at least five different times today to schedule myself and my wife. I finally gave up, realising that the system simply was under way too much strain. Thirty second page loads, timeouts, mystic error messages never meant to be exposed to an end user, it’s all classic symptoms of a system under way more load that it can handle.
The solution is very simple, you have to test systems under the load that they are designed to be used.
The problem is, that’s one of the most expensive tests software developers perform. It can take a month and more just to prepare. A big part of that is simply getting enough synthetic data to test with. The data has to be good enough to be used with your validation checks. The system has to be designed so that you can run such a test at scale without committing other systems to act on the test data. It’s hard, it’s expensive, and often by the time you are ready to load test, the project is already late, and product needs to ship, or at least that’s what management says.
And so, against better judgement, the product ships without being tested. And it fails. Badly.
Somewhere, a resource gets locked for long than it should, and that causes contention, and the system slows to an unusable crawl. It’s an unneceaary table lock in a database. A mutex on a singleton in the runtime library. A semaphore wrapped inappropriately around a long running cleanup operation. Diagnosis takes days, and then weeks. Sometimes a critical component needs to be rearchitected, or worse, replaced. At other times, the fix is simple, once you finally find the error. And sometimes there’s a host of optimizations that are needed. What would have delayed the delivery of the product by weeks now delays it by months... or even years. Some never recover.
Often, the skills necessary to engineer the system to a sustained load are simply absent, to the point that the expected loads and response times were never actually provided as design inputs. Nobody computed them, b/c the system is new, and nobody knew how to without any real world experience.
Load testing is the most engineering intensive effort of “software engineering”. It’s the kind of effort that distinguishes a true software engineer or system architect from a “computer programmer”.
Discovering that your system won’t operate under the expected load is not the worst thing that can happen to you. Doing so after you’ve “gone live” is. Now you have three expensive efforts simultaneously to address:
- The political fallout of the disaster to manage.
- The recovery effort on the broken data streams created by the failed system.
- The load testing you should have done in the first place, including the accompanying remediation of issues found from that effort.
0 comments:
Post a Comment