Monte Python Simulation: misunderstanding Monte Carlo

I recently found myself in yet another circular Twitter discussion of estimation, in which the One True Way to scope work in uncertainty ranged from entirely abandoning estimation to applying formal Cost Accounting methods and nothing less would suffice. I’ve talked about this at length and I will happily excise any comments that get into #noestimates.

One topic that came up from the Cost Accounting camp was the use of numerical methods as estimation tools, in particular Monte Carlo Simulation. I questioned the method’s applicability in this case and ended up a side conversation with the lovely Troy Magennis, who builds open source statistical modelling tools for a hobby. He asked me to elaborate on this, and in particular about where I think Monte Carlo is useful and where it is misapplied.

Here is the relevant part of the Twitter conversation for context, with emphasis added:

Him: Yes, and you need to define the procession and accuracy “needed” for the estimate to be useful before starting the effort to produce the estimate. A ROM may be all that is needed ±10X An 80% confidence of “on of before” may be needed before starting work Credible estimating is a continuous process of refining the estimate with actual data, updates to the model that produced the estimate, and corrective and preventive actions from Continuous Risk Management.

Me: Credible estimating is doing the bare minimum you can get away with to materially support the decisions you are trying to make. Any more than that is just work creation for project managers. ”80% confidence of on-or-before" is a meaningless term for a single project. It means “If we carry out this exact project a statistically significant number of times (>20 say) then 80% of those will be within this date.” But we will carry it out exactly once. Ever. Statistics matters.

Him: Most certainly not. This is “proposal submittal” criteria @ NASA. Number is risk informed (reducible & Irreducible) from Monte Carlo simulation using Reference Class Forecasting of previous programs describe in the “Past Performance Section” of the proposal Yes, informed by “value at risk” You’ve got $10k at risk, bare minimum is much different than [when] you’ve got $10B at risk. Context and domain needed before any platitudes are useful.

Me: and any Monte Carlo simulation has parameters that are guesses, with probability distributions that are more guesses. The more times you do something the more likely you are to choose good parameters and curves, but it takes time and is expensive.

So here goes, but first a story.

Statistics for Microbiologists

I did a degree in Mathematics with Computer Science about 100 years ago, or more accurately in the late ‘80s, and I remember my surprise at how badly they were teaching statistics in non-mathematical subjects. I had a friend doing Microbiology, and the rule was that they had to pass the Statistics module at some point during their degree course. If they failed the module in the first year they would repeat it in the second year, and so on until they passed.

My friend Mary was a smart girl and a good student, and she kept failing this stats module. She was convincing herself she was useless at maths and would never make it as a microbiologist. I asked to see the past papers and they didn’t look that difficult, so I asked her to show me her stats text book. Its title was “Statistics for Microbiologists” and my immediate thought was “Why would microbiologists need their own statistics book? Statistics is statistics!” Unsurprisingly the author was one of the microbiology lecturers, who would make a few royalty bucks each academic year by inflicting their book on all their students.

And it was shockingly bad! It didn’t make any sense, and was clearly written by someone who didn’t understand statistics. To misquote Pauli, it wasn’t just bad, it was not even wrong. To add insult to injury, the faculty considered teaching statistics as a short straw, so some junior lecturer or other would, reluctantly, drone through the content week after week, and would dread seeing the same faces back the next year after they had once again flunked both the module and the retakes.

So I offered to teach Mary statistics in return for coffee and chocolate—they were simpler times—and I remember her response as she sliced her way through one past paper after another. “Is that it? Is that all there is to it?” Part incredulous, part furious at how much of her time and energy she had wasted in these pointless lectures.

Pretty soon I was hosting a stats tutorial for all her microbiology pals, and sure enough they all stormed the stats module too. I’m not telling this story to tell you how great a stats teacher I am, but to suggest the standard of teaching stats was so poor outside of the Maths faculty that even I could do a better job (based on a sample of one college, oh the hypocrisy), which may go some way to explaining how it is so often poorly misapplied. I’m sure you’ve seen the many articles talking about how we get Bayesian statistics just as wrong.

So then, on to Monte Carlo. I’ve seen an increase recently in people saying they want to use a Monte Carlo simulation in order to estimate likely project length, or more specifically to “define a 90% confidence level” for a project length.

We are using the wrong tool

Monte Carlo predicts a probability distribution for a number of future trials. We are using it to estimate the result of a single trial. Monte Carlo is a group of methods for modelling a probability distribution for a given type of event, where that event is controlled by a number of independent parameters. Say you want to decide the location for a new distribution warehouse. You want to site it such that you can be confident 90% of deliveries will be on time. You might use a Monte Carlo simulation to model the distribution of delivery times based on a number of parameters, and use this model to assess which of a number of possible locations you should choose. Once the warehouse is up and running you can build a distribution curve based on real delivery times, and replace your theoretical model with an empirical one.

Monte Carlo modelling allows us to build a theoretical model of the distribution of a set of similar events where it would be impractical to try to build an empirical model. You can use this to build models like the example above, where each event is a delivery that may or may not be on time.

You define a function of several parameters, each of which has its own probability distribution, and use this to carry out a number of simulations. For each simulation you take a random value of each parameter based on its probably distribution, and use that set of values in the Monte Carlo function to derive a sample result. You then build a histogram of these results, and this histogram represents the probability distribution of the event you are modelling.

90% confident, or confident 90% of the time?

For a single event the interpretation is a bit different. A single sample is what it is, or rather what it will be, and for any event with uncertainty we can’t know beforehand what the answer will be on that occasion. You can’t know whether any one delivery will be late.

You could use the probability distribution to price an insurance policy or financial option for that event, which is the principle behind Black-Scholes. In other words you could bet against yourself and use that to hedge potential failure. With Black-Scholes you have a model of how you think the value of a financial instrument behaves over time, and you use this to price an option, which is the right but not the obligation to transact at some agreed point in the future. The value of the option varies over time based on the observed values of the parameters, including how much time remains.

The choice to exercise an option is like a single bet: you will make or lose money on it. Options trading works because you make lots of these bets, and your wins and losses balance out over time, ideally in your favour if your models are any good. Likewise, if you can produce a model of the likelihood of success of software projects, and you were to bet on lots of these projects over time, and they conformed sufficiently well to your model, you could be reasonably confident of success across the entire portfolio of projects over time. But this offers no guarantees about any single project.

As an example, look at surgery success rates. A surgeon will have an outcome histogram over time for a particular procedure. They can tell you the likelihood of various outcomes based on observed results, and ideally based on their own personal success rates. For the surgeon this is a distribution. For you, all you care about is this one procedure. Your outcome is binary—you live or you die—and the data can only tell you what the odds look like over a statistically significant sample size. I see this is as a false sense of security, because as humans we can’t make a value judgement between 90% and 95%, or between 95% and 97%. We can probably decide between say 65% and 75%, but then how do you interpret your own appetite for risk if someone says there is a 2-in-3 chance of success vs a 3-in-4 chance?

A single software project will run exactly once. Even if you run the same project again with the same people, things will be different. The people will have changed, the organisation will have moved on, the context will be different. “You never cross the same river twice.

We are using the tool wrong

Choosing valid inputs for a Monte Carlo simulation is hard!

The Monte Carlo function comprises a number of assumptions, which might be empirically derived or might just be guesses depending on the information available. Choosing significant independent parameters, understanding their distributions, and defining a function to predict the behaviour of a complex adaptive system, are all hard, and all differently hard, and all differently prone to errors and biases.

There are three types of assumptions:

  1. We assume a function exists based on a number of independent parameters which can represent values of the event.
  2. We assume:
  • we know a set of parameters that completely describe this function
  • that these parameters are independent, and
  • we know how to calculate the values (i.e. that we can define the function).
  1. We assume we know the probability distribution of each of these parameters, so we can pick random values for each one and be confident that the value represents a realistic sample for that parameter.

Given all these assumptions, you can carry out a number of simulations. For each simulation you take a random set of values and plug them into your Monte Carlo function to derive a sample result, and you do this again and again to build your histogram.

This approach breaks down if any of the assumptions are incorrect. Specifically:

  1. if there isn’t a function of several parameters, or if there is a relationship between any of the parameters that we don’t understand, such that they aren’t independent.
  2. if we don’t identify all the parameters that describe the event, either by over- or under-specifying them.
  3. if we choose the wrong probability distribution for any of the parameters, so we choose “random” values that don’t represent that parameter.

We are asking the wrong question

“On time and on budget” is meaningless without an indication of value.

Incidental to all this, although just as important, is what question we are modelling when we use a Monte Carlo simulation. We typically care about whether the project will be on time and on budget, and the PMO and other steering folk track this ruthlessly. Although these numbers are necessary in a cost-accounting world, there is more value in knowing time to initial business impact, and modelling the business impact curve. Ideally we should be modelling Cost of Delay and Risk-Adjusted Return on Capital as the genuine economic indicators of the impact of our work, but these often lag too far behind to be a useful indicator.

For instance we could hypothesise that by doing some work we could simplify a business process (this is the business impact) and that the simpler process would need fewer people, which would then lower our operating costs (the consequent business value). We can’t directly deliver the business value, but we can deliver business impact and track our hypothesis about business value over time.

Legitimate use of Monte Carlo Simulation

This doesn’t mean you can’t or shouldn’t use Monte Carlo in software development. I have seen a number of situations where people are benefiting from the method.

Example 1: Predicting delivery of features

Teams often work from a backlog of features or other work items and track these using a tracking tool like Jira. Some of these teams have been using Monte Carlo simulations to predict throughput of features based on historical data.

A number of conditions are necessary for this to be valid:

  • The past work should be representative of the future work. If the last few months has been about adding new features and the next few months is about integrating with other systems, the data is unlikely to be representative.
  • The future delivery context will not change significantly compared to the recent context. If the team is changing or you have yet another transformation initiative or re-org rolling out, this is likely to affect the delivery histogram.
  • The Monte Carlo function should be a reasonable indicator of the past work. It is easy to define a Monte Carlo function. Identifying the parameters that genuinely capture the behaviour, and using them to define a function that reasonably represents the historical and future data, is hard.
  • The team understands the probability distribution of each of the Monte Carlo function’s input parameters. Even if you can identify the factors that affect the rate of delivery, their respective probability distributions might not be obvious, which means each “random” value you get for a Monte Carlo trial may not produce a representative result. This means the histogram won’t model reality.

On one programme of around 12 or so teams, several of the teams were using this kind of Monte Carlo analysis to model their anticipated delivery rate for their Product Management team. This allowed the product managers to plan a product roadmap and know when to engage external stakeholders. Over time the team and the product managers grew to trust these models and integrated them into their day-to-day product strategy.

Example 2: Using Monte Carlo to explore alternatives

For a single point project you can use Monte Carlo simulation to explore how different assumptions about the input parameters affect the bigger picture, an activity called Sensitivity Analysis. You can ask questions like “If we can constrain the likely values of this parameter in this way, what impact will that have on the likely result?” This kind of analysis can also uncover unexpected correlation between variables you thought were independent.

Say one of the parameters was the delivery date of a key dependency. This is bounded below by the earliest possible delivery date, but theoretically unbounded above (those pesky vendors!). You might model the parameter as a skewed normal distribution, or an exponentially decaying curve, or some other function that fits the general description. You plug in the probability distribution function for the parameter, run the simulation, and get a histogram plot for the project’s likely end date.

What if you could do something with the vendor to change that curve, by say imposing a penalty or reward? This might make the bell much narrower and the long tail much flatter, which would represent a higher likelihood of the vendor delivering close to the promised date (technically a smaller standard deviation). You would then re-run the Monte Carlo simulation and see how this affects the result.

You can experiment with the controlling functions for the parameters in this way to see the impact on the wider model of changing the assumptions for particular parameters. This tells you which changes might give the highest leverage, and conversely which things aren’t worth going after, even if they are tempting. One set of assumptions might give you a narrower standard deviation in the Monte Carlo histogram for the project delivery date, so you would have more confidence the project would be within a given window under these assumptions. Another might bring forward the earliest date, but fatten the tail as well. This would mean you have made it possible to deliver sooner but in a way that increases the likelihood of being late.

Conclusion

Monte Carlo modelling can be a powerful tool in situations where it is impractical or uneconomical to build an empirical model. It builds a histogram that represents a likely distribution of future samples or trials.

It doesn’t make sense to build a Monte Carlo model for a single trial, such as a software project or a single feature, unless you have a valid reason to want a specific confidence level, and the ability to discern between, say, 84% and 93%, which in most cases you don’t. There are still cases where Monte Carlo simulation is useful for software development, such as predicting feature throughput for a statistically significant number of future features, or exploring how changes to assumptions of the control variables affect the resulting distribution.

So don’t be seduced by statistical simulations, manipulated by mathematical models, or otherwise blinded by science theatre, and learn to identify the Monte Python circus.