Monday, February 8, 2010

Why we can't finish stuff: Bottlenecks and Pareto

Why does the law-making process take much longer to finish than expected? Why does a project schedule consistently overrun? Why does it take so long to write a book?

In any individual case, you can often pinpoint the specific reason for an unwanted delay. But you will find (not surprisingly) that a single reason won't universally explain all delays across a population. And we don't seem to learn too quickly. For example, if we did learn what we did wrong in one case, ideally we should no longer do such a poor job the next time we tried to get something done. Yet, inevitably we will face a different set of bottlenecks in the process next time we try. I will demonstrate that these invariably derive from a basic uncertainty in our estimates of the rate it takes to do some task.

The explanation that follows belongs under the category of fat-tail and gray swan statistics. As far as I can tell, no one has treated the analysis quite in this fashion, even though it comes from some very basic concepts. In my own opinion, this has huge implications for how we look at bottom-line human productivity and how effectively we can manage uncertainty.

Why we can't finish stuff: Bottlenecks and Pareto

The classical (old-fashioned) development cycle follows a sequential process. In the bureaucratic sense it starts from a customer's specification of a desired product. Thereafter, the basic stages include requirements decomposition, preliminary design, detailed design, development, integration, and test. Iterations and spirals can exist on individual cycles, but the general staging remains -- the completion of each stage creates a natural succession to the next stage, thus leading to an overall sequential process. For now, I won't suggest either that this works well or that it falls flat. This just describes the way things typically get done under some managed regiment. You find this sequence in many projects and it gets a very thorough analysis by Eggers and O'Leary in their book "If We Can Put a Man on the Moon ... Getting Big Things Done in Government".


Figure 1: Eggers and O'Leary's project roadmap consists of five stages.
The stage called Stargate refers to the transition between virtual and real.

If done methodically, one can't really find too much to criticize about this process. It fosters a thorough approach to information gathering and careful review of the system design as it gains momentum. Given that the cycle depends on planning up front, someone needs to critically estimate each stage's duration and lay these into the project's master schedule. Only then can the project's team leader generate a bottom-line estimate for the final product delivery. Unfortunately, for any project of some size or complexity, the timed development cycle routinely overshoots the original estimate by a significant amount. Many of us have ideas of why this happens, and Eggers and O'Leary describes some of the specific problems, but no one has put a quantitative spin on the analysis.

The premise of this analysis is to put the cycle into a probabilistic perspective. We thus interpret each stage in the journey in stochastic terms and see exactly how it evolves. We don't have to know exactly the reasons for delays, just that we have uncertainty in the range of the delays.

Uncertainty in Rates

In Figure 1, we assume that we don't have extensive amounts of knowledge about how long a specific stage will take to complete. At the very least, we can estimate an average time for each stage's length. If only this average gets considered, then we can estimate the aggregated duration as the sum of the individual stages. That becomes the bottom-line number that management and the customer will have a deep interest in, but it does not tell the entire story. In fact, we need to instead consider a range around the average, and more importantly, we have to pick the right measure (or metric) to take as the average. As the initial premise, let's consider building the analysis around these two points:
  1. We have limited knowledge of the spread around the average.
  2. Use something other than time as the correct metric to evaluate.
I suggest we should use speed or rate as a metric instead of using time as the probability density function (PDF) for each stage.

But what happens if we lack an estimate for a realistic range in rates? Development projects do not share the same relative predictability of marathons, and so we have to deal with the added uncertainty of project completion. We do this by deriving from the principle of maximum entropy (MaxEnt). This postulates the idea that if we only have knowledge about some constraint (say an average value) then the most likely distribution of values within the constraint is the distribution that maximizes the entropy. This amounts to the same thing as maximizing uncertainty and it works out as a completely non-speculative procedure in that we can introduce as preconditions only the information that we know. Fortunately, the maximum entropy for a distribution parametrized solely by a single mean value is well known; I have used that same principle elsewhere to solve some sticky physics problems.

If we string several of these PDF's together to emulate a real staged process, then we can understand how the rate-based spread can cause the estimates to diverge from the expected value. In the following graphs, we chain 5 stochastic stages together (see Figure 1 for the schematic) with each stage parameterized by the average value T = \tau =10 time units. The time-based is simply a Gamma distribution of order 5. This has a mean of 5*10 and a mode (peak value) of (5-1)*10. Thus, the expected value is very close to the sum of the individual expected stage times. However, the staging of the rate-based distributions does not sharpen much (if any), and the majority of the completed efforts extend well beyond the expected values. This turns the highly predictable critical path into a routinely severely bottlenecked process.


Figure 2: Convolution of five PDF's assuming a mean time and mean rate

The time integration of the PDF's gives the cumulative distribution function (CDF); this becomes equivalent to the MSCP for the project depending on the scheduled completion date. You can see that the time-based estimate has a more narrow envelope and reaches a 60% success rate for meeting the original scheduled goal of 50 time units. On the other hand, the rate-based model has a comparatively very poor MSCP, achieving only a 9% success rate for the original goal of 50 time units. By the same token, it will take 150 time units to have the same 60% confidence level that we had for the time-based model, about 3 times as long as what we desired.

The reason for the divergence has to do with the fat-tail power laws in the PDF of the rate-based curves. Our original single stage cumulative success probability clearly diverges as we add more stages, and it will just get worse as we add more.

Bad News for Management

This gives us an explanation of why scheduled projects never meet their deadlines. Now, what do we do about it?

Not much. We can either (1) get our act together and remove all uncertainties and perhaps obtain funding for a zipper manufacturing plant, or (2) we can try to do tasks in parallel.

For case 1, we do our best to avoid piling on extra stages in the sequential process unless we can characterize each stage to the utmost degree. The successful chip companies in Silicon Valley, with their emphasis on formal design and empirical material science processes, have the characterization nailed. They can schedule each stage with certainty and retain confidence that the resultant outcome stays within their confidence limits.

For case 2, let us take the example of writing a book as a set of chapters. We have all heard about the author who spends 10 years trying to complete a manuscript. In many of these cases, the author simply got stuck at a certain stage. This simply reflects the uncertainty in setting a baseline pages per day writing pace. If you want to write a long manuscript, slowing down your pace in writing will absolutely kill your ability to meet deadlines. You can get bursts of creativity where you can briefly maintain a good pace, but the long interludes of slow writing totally outweigh this effect. As a way around this, you can also try to write stream-of-consciousness, meet your deadlines, and leave it in that state when done. This explains why we see many bad movies as well. Works of art don't have to "work as advertised" as an exit criteria. On the other hand, a successful and prolific author will never let his output stall when he lacks motivation in writing a specific chapter. Instead, the author will jump around and multitask on various chapters to keep his productivity up.

I don't lay complete blame on just the variances in productivity, as the peculiar properties of rates also play a factor. A similar rate fate awaits you if you ever happen to do biking in hilly country. You would think that the fast pace in going down hills will counter-balance the slower pace in going up hills and you might think you could keep up the same rate as on flat terrain. Not even close, as the slow rates going up-hill absolutely slows you down in the long run. That results from the mathematical properties of working with rates. Say that you had a 10 mile uphill stretch followed by a 10 mile downhill stretch, both equally steep. If you could achieve a rate of 20 MPH on flat terrain, you would feel good if you could maintain 12 MPH going up the hill. Over 10 miles, this would take you 50 minutes to complete. But then you would have to go down the hill at 60 MPH (!) to match the ground covered on flat terrain over that same total distance. This might seem non-intuitive until you do the math and you realize how much slow rates can slow your overall time down.

But then you look at a more rigidly designed transportation system such as bus schedules. Even though a bus line consists of many segments, most buses routinely arrive on schedule. The piling on of stages actually improves the statistics, just as the Central Limit Theorem would predict. The scheduling turns out highly predictable because the schedulers understand the routine delays, such as traffic lights, and have the characteristics nailed.

On the other hand, software development efforts and other complex designs do not have the process nailed. At best, we can guess at programmer productivity (a rate-based metric, i.e. lines of code/day) with a high degree of uncertainty, and we wonder why we don't meet schedules. For software, we can use some of the same tricks as book-writing and so for example skip around the class hierarchy if you get stuck. But pesky debugging can really kill the progress as it effectively slows down a programmer's productivity.

The legislative process also has little by way of alternatives. Since most laws follow a sequential process, they become very prone to delays. Consider just the fact that no one can ascertain the potential for filibuster or the page count of the proposed bill itself. Actually reading the contents of a bill could add so much uncertainty to the stage that the estimate for completion never matches the actual time. No wonder that no legislator actually reads the bills that quickly get pushed through the system. We get marginal laws fool of loopholes as a result of this. Only a total reworking of the process into more concurrent activities will accelerate this process.

And then the Pareto Principle comes in

If we look again at Figure 2, you can see another implication of the fat over-run tail. Many software developers have heard of the 80/20 rule, aka the Pareto Principle, where 80% of the time gets spent on 20% of the overall scheduled effort. In the figure below, I have placed a few duration bars to show how the 80/20 rule manifests itself in rate-driven scheduling.

Because this curve shows probability, we need to consider the 80/20 law in probabilistic terms. For a single stage, 80% of the effort routinely completes in the predicted time, but the last 20% of the effort, depending on how you set the exit criteria, can easily consume over 80% of the time. Although this describes only a single stage of development and most people would ascribe the 80/20 rule to variations within the stage, a general class equivalence holds.

Figure 3: Normalizing time ratios, we on average spend less than 20% of our time on 80% of the phase effort, and at least 80% of our time on the rest of the effort. This is the famous Pareto Principle or the 80-20 rule known to management.

In terms of software, the long poles in the 80/20 tent relate to cases of prolonged debugging efforts, and the shorter durations to where a steady development pace occurs.

These rules essentially show the asymmetry in effort versus time and quantitatively depends on how far you extend the tail. A variation of the Pareto Principle describes the 90-9-1 rule.

  • 100 time units to get 90% done
  • 1000 time units to get the next 9% done
  • 10000 time units to get the “final” 1% done

In strict probability terms, nothing ever quite finishes and business decisions will determine the exit criteria. This truly only happens with fat-tail statistics and again it only gets worse as we add stages. The possibility exists that the stochastic uncertainty in these schedule estimates don't turn out as bad as I suggest. If only we can improve our production process to eliminate potentially slow productivity paths through the system, this analysis will become moot. That may indeed occur, but our model does describe the real development world as we currently practice it quite well. Empirically, the construction of complex projects often take five times as long as originally desired and projects do get canceled because of this delay. Bills don't get turned into laws, and the next great novel never gets finished.

The Bottom Line

Lengthy delays occur in this model entirely from applying an uncertainty to our estimates. In other words, without further information about the actual productivity rates we can ultimately obtain, an average rate becomes our best estimate. A rate derived application of the maximum entropy principle thus helps guide our intuition, and to best solve the problem, we need to characterize and understand the entropic nature of the fundamental process. For now, we can only harness the beast of entropy, we cannot tame it.

Will our society ever get in gear and accommodate change effectively? History tells us that just about everything happens in stages. Watch how long it will take the ideas that I present here to eventually get accepted. I will observe and report back in several years.