No risk, no fun
[Cost of Delay series — Part VII: “Risk management”]
“The risk of a wrong decision is preferable to the terror of indecision.”
(Maimonides)
In previous articles, we discussed two types of financial benefit on the cash income side (which we called Maintenance and Growth) as well as one on the cash outflow side (Cost savings). That leaves us with one category to tackle: how do we capture the benefits of avoiding a future cash outlay in Cost of Delay terms? (Risk management).
In Part IV, we saw how all Cost of Delay modeling comes down to understanding how a delay in the execution of a project affects the project’s payoff function. We’ve already reviewed several mechanism: a delay can impact the payoff by pushing it out in time, by reducing the peak level… In this article, we’ll examine how a delay in execution can affect the probability of realizing the payoff curve. Such is the nature of cost avoidance projects: we know what we want to do to avoid a certain future cost, but what we don’t know for sure is when we would incur that cost, how likely we are to incur it if we take action today, and how much likelier if we wait to take action. In our risk management category, our modeling job is to look at the uncertainty around execution time as well as the uncertainty in the payoff function.
A first case we can think of is uncertainty around execution timing combined with a payoff function with known timing. This is a setup we are all familiar with: deadlines with financial consequences. Think about a situation like this. We have a custom project to deliver to a customer, and for every month late we are contractually committed to pay a penalty. This gives us a very easy base payoff function: it is literally already expressed as an amount per unit of time execution delay. Say our delivery deadline is 6 months from now, and we think the project will take about 4 months. So it’s pretty safe to say that the Cost of Delay of waiting a week is currently close to 0, and waiting another week would probably be fine. Having said that, if we wait 8 weeks and only start the work 4 months before delivery, that feels… risky. HOW risky depends of course on the uncertainty around our execution time estimate. Is it 4 months, plus or minus 1 week, or plus or minus 3 months?
Starting from some probability curve around our estimate of execution time, we can also estimate the Cost of Delay. If the penalty for late delivery is 1,000 per month of delay, and we estimate the probability of being one month late with six months to go at 5%, our Cost of Delay estimate is 50 per month. If we don’t start the work and re-evaluate one week later, the probability of being late will have gone up a bit, and our estimated Cost of Delay will also creep up week after week. This isn’t rocket science, but keeping our estimates simple is usually fine. This is a case where being approximately right is good enough. Our Cost of Delay models are not competing with the cutting edge in modeling and decision making theory. They are competing with doing nothing and taking decisions without any basis at all.
If the importance of the decision is big enough and we need a better CoD model, that implies we need a better probability estimate. We know from the real world that projects are far more likely to be late than to complete early. And we also know that IF they are late, they can get very late indeed… Poking around a little on the web[1] teaches us that this is consistent with the probability profile of the lognormal distribution, which looks something like this.
On the left is the probability distribution of project execution times, and on the right the cumulative distribution. The probability curve shows a rather short left tail of early completion times, followed by a range of completion times where the majority of projects end up, and a longish tail of projects with much longer completion times. If we can make a reasonable estimate of the average completion time and the variability around this average, we can express those in a cumulative distribution profile like the one in the graph. This is a function with time on the x-axis and probability on the y-axis — exactly what we need for our Cost of Delay model. Even better of course is when we can collect historical data on predicted and actual completion times for our projects. But no matter how simple or sophisticated our model is: the way to deal with uncertainty around execution time is to construct a cumulative distribution profile. If we wait long enough, we are 100% guaranteed to incur the penalty for late delivery.
Let’s now have a look at the uncertainty around the payoff timing, which is a lot harder to wrap our heads around. Real world examples are what we typically think of when we think of risk management: things like preventive maintenance actions to avoid some form of failure in the future, work to pay down technical debt in our systems, etc. For this type of project, we don’t know the timing of the payoff (which is a negative payoff, to be clear). Actually we don’t even know if there will be any payoff moment at all. We may invest team time and effort in eliminating a risk that never materializes. These risk and technical debt reduction projects are often the hardest to quantify. That inevitably also means it is difficult to judge how we should prioritize them relative to the rest of the project portfolio. We never get around to executing these risk reduction projects, because it always feels like they are not urgent. Until it is too late… So how can we put together a reasonable Cost of Delay estimate for a risk reduction project, such that it gets a fair chance competing for team time?
We can break that problem down in two parts: we need to estimate the amount of financial damage (the payoff) if the incident occurs, and we need to deal with the uncertainty of the payoff timing. Our estimates of financial damage will be rather crude, but that’s OK. We probably have a fairly good idea of the one off costs we will incur if the incident materializes. The cost of equipment we’ll need to bring in, the cost of consultants or lawyers, fines, third party compensation… even if these will happen somewhat spread out in time, we can assume for all intents and purposes they are all incurred at the moment of the incident itself. And once we have an estimated aggregate lump sum amount, we know (from Part VI) how to convert it to a recurring rate equivalent. To this one off cost, we need to add an estimate of what the incident will cost us in staff time. We should err on the side of pessimism, because problems tend to proliferate or cascade once things go south, and we truly don’t know what we don’t know. With our estimate of impacted staff time, we won’t make the mistake of quantifying the cost in salary terms but in terms of Cost of Delay impact. Handling the incident will be a distraction, but we’d still like to execute the highest value projects in queue and sacrifice a less valuable chunk of the portfolio. Whatever percentage that is, we can add that CoD rate to the one derived from one off costs. And presto, we have a financial damage estimate expressed as a recurring rate. Not easy, but it’s possible.
How can we then incorporate the uncertainty around incident timing? Just like we did with the uncertainty around execution timing, this comes down to modeling a cumulative probability distribution. For our purpose, we’re not interested to predict exactly when an incident will occur — which is what the probability distribution tries to tell us. Instead, we want to predict the chance that some form of unpleasantness will have happened at or before a certain moment in time — which is what the cumulative distribution is for. As an example, assume the failure of some system component would have catastrophic consequences. And suppose we have a pretty good idea of the typical lifespan of that component, say 5 years. The probability of failure will rise over time: the probability it fails tomorrow is close to zero, but by the time 5 years have gone by the thing will almost certainly have failed at some point: a probability close to one. The cumulative distribution describes the likely path this probability takes between 0 and 1.
If the incident is equally likely to happen at any individual moment between now and some future moment, the function reflecting that is the exponential distribution. It reflects a “memoryless” process, in which events occur continuously and independently at a constant average rate. The cumulative distribution is described by the formula f(t) = 1 — e^-λt. Here’s what it looks like.
Combining this probability with the financial damage produces our Cost of Delay estimate. The increasing probability will push CoD up as time goes by, which will increase our risk reduction project’s priority relative to the other project. To illustrate that, assume in the next example we identified a risk which we quantified as having a CoD of about 83K per month (which, if we overlay the probability, the grey curve in the graph below approaches in the fullness of time).
The colored lines in the graph reflect projects with a CoD of 20K, 80K and 130K per month respectively (with the 80K regime illustrating some variability over time in the CoD estimate). Compared to these alternative uses of team time, it doesn’t take too long for the risk reduction project to become higher priority than the 20K / month project. Assuming the x-axis in this hypothetical example is in years, after only 2–3 months it becomes higher priority for the team to remove the financial overhang of the incident. Compared to 80K / month as the marginal project, the risk reduction project becomes roughly equal priority after 18–24 months. Compared to 130K / month, it never does. As the wiggly 80K line illustrates, it can be useful to keep a list of risk reduction projects nearby, and take advantage of a temporary slowdown in high value projects to put them onto the team schedule.
Actually, the way we’ve looked at the CoD of various projects so far is not ideal for the purpose of relative prioritization. The reason is we haven’t taken into account how long the execution of each project will take. A better way to compare projects is dividing Cost of Delay by the execution time (duration), producing the so called “CD3”. Suppose the risk reduction project in our example takes 2 weeks to execute, versus 2, 4 and 8 weeks for the 20K, 80K and 130K regimes respectively. That would produce the following comparison which paints quite a different picture. Driven by its short execution time, taking out the risk now looks more attractive much earlier relative to the other opportunities. Viewed through this lens, the risk reduction project even take priority over the 130K one as early as 6 months.
To model the probability path, we aren’t limited to the exponential distribution — which uniformly spreads out incident probability over time. And for some incidents, the likelihood may indeed be the same at any moment. But we also know that many types of incidents don’t behave like that, and their likelihood may decrease or increase over time. The source of potential trouble may manifest early (teething problems, infant mortality…) or it will be skewed towards longer timeframes (aging, wear and tear…) If we want to reflect this dynamic in our CoD model, we can use the Weibull distribution, with a shape parameter >1 or <1 respectively. (The exponential distribution is a special case of the Weibull distribution, with a shape parameter equal to 1).
In our example, altering the cumulative distribution to reflect an “infant mortality” case now looks like this. Reducing the risk takes priority earlier than with the uniform exponential distribution. After about a year, estimated CoD is now roughly the same as the 80K project. That makes sense: for this type of risk, we’ll either address it early on, or not at all.
The aging or wear-out case looks like below. Here the opposite is true: because risk of failure is not very likely early on, it takes well over a year before we’d give risk reduction priority over the 20K project. Compared to the 80K one, it takes even longer.
As with any other Cost of Delay estimate, our goal isn’t carrying out highly precise calculations to produce some mathematically correct number. We’re interested in better decision making, and it may not be necessary to be overly worried about Weibull distributions and what not. What matters most is we have the right conversations, making the assumptions in our models visible and open to debate, and ensuring there is consistency between our analysis and decisions. (“If this is what we believe, then that is what we should do.”) If we lay out the CD3 graph, for example, we’re essentially telling ourselves that the risk reduction project eventually becomes and order of magnitude more important than the other things on our To Do list. So even allowing for a lot of imprecision, we’re not consistent with our own analysis if we keep postponing it. In the final article, we’ll reflect some more on such real world considerations, and wrap up with a summary overview of the entire series.
[1] Erik Bernhardsson has a wonderful article on this at https://erikbern.com/2019/04/15/why-software-projects-take-longer-than-you-think-a-statistical-model