Source: Dmitry Ronkonkama
I had been assigned the task of porting a fairly large (about 400 KSLOC) missile launch command and control system to an upgraded OS version and new compiler and language version. Specifically, from Solaris 2.5.1 to Solaris 7, and from the Verdix Ada Development System (VADS), which was Ada 83, to Rational Apex Ada, which was Ada 95. VADS had been bought out by Rational, and its product obsoleted, although Rational did a pretty good job implementing compatible versions of VADS-specific packages to ease the transition to the Apex compiler.
Three other guys helped with the initial compilations, just to get clean compiles of the code, which took about two weeks, and then I was on my own to actually make the whole system work. Long story short, it was the worst design and implementation of a software system I'd ever seen, and so took about two more months to successfully complete the port. It was then handed over for formal testing, which took several months as well. I fairly steadily fixed the bugs that were found as testing got going, but that rate quickly declined as it progressed (the original code was a production system after all, so its functionality was pretty solid, I just had to kill the bugs that came about due to adapting to the new compiler). Eventually I was reassigned to another project once everything appeared to be working as well as the original.
Then came the phone call on the Friday before Thanksgiving.
There was a missile test scheduled in about three weeks, and during a lab countdown test the command sequencing had locked up. In real life this would cause a test abort, and if this lock-up occurred within seconds of ignition, a number of irreversible actions would have taken place in support systems, causing a lengthy--and expensive--delay for reprocessing the missile. The missile would not have launched, but there would have been many, many very unhappy people seriously distressed over issues of time and much, much money. (Don't let anyone ever tell you that the Defense Department is cavalier about spending money--I've yet to meet a contract manager for whom budget wasn't their number 1 or 2 priority, with schedule being the other.)
Now this countdown test and many variations of it had been run hundreds of times in the preceding months, with only a handful of minor glitches. So this problem had a very low probability of occurrence, but unfortunately possessed a very high cost of occurrence. Multiply those together and the product was a bad Thanksgiving week for me and dozens of other engineers and managers.
As the guy who did the port this put the spotlight right on me.
Like most safety-critical defense systems like this, a lot of logging is captured, so it was fairly easy to locate the handful of lines of code that had been most recently executed when the system froze. And of course there was absolutely nothing questionable in those lines of code, and these same statements had already successfully executed literally thousands of times during that same run.
We put the Apex guys at Rational on notice, since it was their compiler and some of their vendor-supplied routines were being called in this area, and it was impressed on them (and everyone) that this was a problem of literally national importance that had to be tracked down. So they got their Thanksgiving week trashed as well.
Since the logs could only tell us so much, we needed to try to repeat the problem in the local lab. For something that pops up in only 1 in a 1000 test runs that's not going to be easy. Amongst the conjectures as to root cause was that a call into a vendor-supplied mutex (part of a VADS migration package) Unlock function was not unlocking. The processing thread that made this call was handling a heartbeat message that nominally arrived every few seconds. So we tried upping the rate on that heartbeat to 10 Hz, i.e., 10 per second, and kicked it off. About an hour later the system locked up. And, when reviewing the logs we saw that the same sequence of logged messages was occurring as had taken place in the failed run. Several runs were made, and it would consistently lock up sometime between 45 and 90 minutes after starting, and each time had the same log trace. So even though we were not now technically running the same code--because of the increased heartbeat rate--the behavior was consistent and so we had high confidence that this stressing scenario was triggering the same problem.
The trick now was to figure out exactly where in the sequence of candidate statements the lock up was occurring.
The implementation of this system used Ada tasking, and used it extraordinarily poorly. Tasking is Ada's high-level concurrency construct, sorta like threads, only built into the language itself. When two tasks communicate, they do it by "rendezvousing", at which time they should exchange any data of interest, and then break the rendezvous and resume their independent executions. This system wasn't implemented that way. Instead, once rendezvous had been made with a target task, that target task would then rendezvous with another task, which in turn would rendezvous with another task, and so on, until eventually some processing would get done, after which all the rendezvous would be broken and each of the tasks would go on their merry way. So what you ended up with was the world's most expensive function calls, bringing an entire, "multi-tasking" process to a halt while it processed a piece of incoming data. It was only because the normal throughput was so low that this hadn't caused performance problems in the past.
The point of this digression about tasking, though, is that when a rendezvous is requested or awaited upon, a "task switch" can occur. This means that the CPU can start processing a different task that's ready to run. So when one task becomes ready to rendezvous with another, a different task may jump in line and get executed, with control eventually getting passed back around to the rendezvousing tasks. Now there are other events that can also cause a task switch, one of which is calling an OS function, like what happens with printing or performing a mutex.
So in tracking down exactly which line was causing the problem I had to find a way to record the progress through the sequence of statements--while not triggering a task switch, which could prevent the problem from occurring. So doing Put_Line() was not an option, no system I/O of any sort could be done. I could set a counter variable or something like that, but how do I see what its value is to tell me how far it got, since I can't print it out?
Now one thing that had been observed in the log files about this executable was that while this heartbeat processing froze--which ultimately led to the process' I/O getting all blocked up, and preventing other necessary processing from occurring--other independent tasks within the executable continued to run. So the process as a whole wasn't getting blocked, just a (critical) task chain within it.
This was the wedge needed to get at locating the offending statement.
I created an Ada package containing an enumeration type, a global variable of that type, and a task. The enumeration literals were keyed to the specific statements in the problematic code sequence (like "Incrementing_Buffer_Index", "Locking_Mutex", "Mutex_Unlocked", etc.) and then into that sequence were inserted assignment statements that assigned the corresponding enumeration to the global variable. Because the object code for this was nothing more than storing a constant into a memory location it was extremely unlikely that a task switch could occur by executing such a statement. In fact, our primary suspicions centered on those statements that involved task switches, since the locking up behavior was consistent with execution not resuming (for some reason) after a task switch back.
The monitoring task then itself did nothing more than loop and periodically check to see if the global variable had changed value. Every time it did, it printed out the value to a file. It then delayed for a small interval, and made its next check. Now the reason I could write to a file from this task was that this task only ran when a task switch had occurred back in the problem area and this task had been selected to run. Whatever was done in this task should have no effect on other, unrelated, blocked tasks.
The behavior that was anticipated here, then, was that when the problem code area was entered it would do its thing and keep resetting the global variable as it progressed past each statement. It would then do something that caused a task switch, and because its execution rate (10 Hz) was slower than that of the monitoring task's, the monitor could grab the value of the global variable and write it out. So under normal behavior I would expect to see a repeating sequence of a subset of the enumerations, specifically each of those that the variable last held before a task switch occurred. And when the freeze happened, that global variable value should no longer change and the last one recorded will indicate from exactly which statement execution never resumed.
Ran the instrumented executable. It froze up. And the monitoring worked like a charm.
The logging of the progress monitoring variable displayed exactly the anticipated sequence, which eventually ceased with a value corresponding to having made a call to the Mutex Unlock function, with the value that should have been stored signaling the resumption of the task never showing up--like it had in the thousands of previous invocations.
So over to you Rational. The Apex engineers during this time had been feverishly analyzing their code and had found a place in the mutex code where it could theoretically block for good, but the odds of that happening were very remote because of everything that had to happen with the right sequencing and timing. Murphy's Law, guys, Murphy's Law.
What I did to work-around this was to replace the calls to the vendor's mutex functions (which were built atop the OS' mutex functionality) protecting this particular sequence of code with a quick little native Ada mutex package, using that to control mutex access to the relevant area.
I put this into the code and reran the test. Seven hours later it was still running.
My mutex package code was given to Rational who compiled and disassembled it and verified that it was not using the same approach that the problematic mutex functions were using.
I then had the most well attended code inspection of my career :-) There were nearly a dozen engineers and managers in the room with me, and at least another dozen dialed in from all over the country, all to inspect about 20 lines of code.
It passed, the new executables were formally built, and it was handed over to the test organization for formal regression testing. A couple weeks later the missile countdown proceeded flawlessly and away it went.
It's a good think I like cold turkey.
Okay, this is all well and fine, but what's really the point of a coding war story?
This was a nasty, nasty problem. There was concurrency, over a dozen communicating processes, hundreds of KSLOCs, poor design, poor implementation, interfaces to embedded systems, and millions of dollars riding on the effort. No pressure, eh?
I wasn't the only developer working on this problem, though having done the original port I was of course the primary focus. But even though I did the porting, that doesn't mean I had intimate knowledge of hundreds of thousands of lines of code--or even a decent overview of it. Other engineers around the country were looking through the code and the logs as well, but I found that when they proposed a hypothesis to me about a root cause, it never took more than 30 seconds on my part to dismiss it, likewise when I was requested to provide various analyses I would shove it off on to someone else because it was clear to me they were on the wrong track. Sound like arrogance on my part? Well, yeah, it does, but that's not why I dismissed these hypotheses and requests.
It was because I knew what the nature of the problem was. I didn't know exactly where it was occurring, nor why it was occurring, but I did know what was happening.
I've built up a lot of experience and knowledge over the years--I was an early adopter of Ada, understand concurrency and its pitfalls, I know how Ada runtime libraries handle tasking and concurrency, and I understand low-level programming at the level of raw memory, registers, and assembly language. In other words, I have deep knowledge of my niche of the industry. All of that was brought to bear in successfully tracking down this problem--not just working around the bug, but understanding how to put together an approach to finding the bug in a very sensitive execution environment.
The specifics of a coding war story probably aren't all that interesting to those who aren't familiar with the particulars of its nature and environment, but they are useful for gleaning an understanding of what it takes to solve really difficult problems.
To solve the really difficult problems you need to be more than a coder, you have to understand the "fate" of that code, how it interacts with its environment, and how its environment itself operates.
Then you too can get your Thanksgiving holiday all messed up.