McKinsey recently published an article claiming they can measure developer productivity. This has provoked something of a backlash from some prominent software people, but I have not seen anyone engage with the content of the article itself, so I thought this would be useful.
I am writing this as though the authors have approached me for a technical review of their article. You can think of it as an open letter.
Dear Chandra and team,
Thank you for inviting me to review your article. You are right to assume that this is an area of interest to me; it is a topic I have spent much time thinking and speaking about over the years.
I have read your article and made some notes, which I detail below. This is an important topic and McKinsey is a high-profile company that many senior technical people listen to, so I want to help you write an article that they will find useful and accurate. Apologies that the feedback is so long and detailed; there is a lot I wanted to cover and I do want you to write a great article.
I have identified several areas of weakness, factual errors, and topics for further study. I have also taken the liberty of highlighting some stylistic or editorial issues which you may want to address. Unfortunately, you will notice that I take issue with your core arguments, so at the end I offer some alternative suggestions for effectively measuring and developing software people.
Before we begin, I could not help noticing that all five contributing authors are male. For a company the size and reach of McKinsey, I am surprised that you were unable to find any women to collaborate on an article that addresses the entire developer population. You may want to think about that for next time.
Software development is under-measured? ¶
You lead with the assertion that software development is under-measured. I am not sure how you conclude this, because you make this statement with no supporting evidence—something that happens numerous times through the article—so let me offer a counterpoint.
Software development is one of the most ruthlessly scrutinised and measured of all business activities. Historically, from the 1960s and 70s onwards, software development has been an expensive and risky endeavour. Initially this was due to the cost of computing and operating machinery, then as this came down and programmer salaries soared, it became about remuneration.
It makes sense to scrutinise any risky and/or expensive undertaking, and unfortunately for the software industry, we applied the same reductionist models that had been working for industry—civil engineering and factory production—such as Gantt charts, resource-levelling, and other tools from the industrial era of scientific management. This assumption that software development has the same reducible complexity as assembling a car lies at the heart of the developer productivity challenge.
The agile movement grew in the 1990s out of a frustration with the poor outcomes of multi-year programmes of work, even though these were planned, measured and scrutinised to the finest detail, which still managed to overrun their budgets and deliver the wrong thing. The one thing you cannot call this era is under-measured. Mis-measured, for sure—and there are ways to measure software development that do track productivity and highlight areas of concern—but never under-measured!
To cut to the chase, I see two main planks to your thesis, which I will return to as we go through the content, and which are both erroneous:
- Software development is a reducible activity, and can be measured with reductionist tools.
- Software development is primarily about coding, and anything other than typing code into a computer terminal is waste which we should seek to eliminate.
I hope to explain why both of these are incorrect as we go through.
Gen-AI tools make developers twice as fast? ¶
You make several assertions in your introduction. I want to highlight two of them, because I think it will help you understand where there are other weaknesses in the article.
Generative AI tools may enable developers to produce code twice as quickly. I do not know, I have not measured this or looked at the research, but I will take your word for it. However, after three decades of hands-on development experience in many large and small organisations, using many programming languages, paradigms, and technologies, and across many business domains, and having consulted in countless others, I can confidently assert that most of programming is not typing code.
Most of programming is learning the business domain; understanding the problem; assessing possible solutions; validating assumptions through feedback and experimentation; identifying, assessing and meeting cross-functional needs such as compliance, security, availability, resilience, accessibility, usability, which can render a product anything from annoying to illegal; ensuring consistency at scale; anticipating likely vectors of change without over-engineering; assessing and choosing suitable technologies; identifying and leveraging pre-existing solutions, whether internally or from third parties.
I am only scratching the surface here. As you can see, the typing-the-code part is only a modest part of all of this, so using Gen-AI to speed that up, while useful, is hardly “making developers twice as fast”.
Initial results are promising? ¶
In his book, Thinking Fast and Slow, Professor Daniel Kahneman introduces the Law of Small Numbers. This is where researchers have a tendency to generalise based on small data sets. He refers to this as erroneous intuition.
You have 20 case studies, which may seem like a lot of engagements for a consulting business, but it is still statistically barely significant. There are other methodological issues too:
- There does not seem to be any control basis; similar teams where you did not apply this method to act as a baseline, or where you tried something else as an alternative hypothesis.
- You fail to consider the Hawthorne Effect, whereby teams behave differently when they are aware they are being observed, although this is disputed.
- You do not seem to isolate for these metrics. What else may have been going on at the time that could explain these improvements? Is this approach causal, just correlated, or coincidental?
I appreciate that this is an informal article rather than a peer-reviewed paper, but presenting a newly minted methodology as showing “promising results” with what is effectively anecdotal evidence could be misleading.
DORA and SPACE have authors! ¶
Speaking of academic rigour, you are clearly fans of the DORA Accelerate research and the later SPACE model, as am I, but you fail to cite Dr. Nicole Forsgren, the lead researcher on both initiatives, or any of her coauthors. Dr. Forsgren applied significant academic rigour to the original State of DevOps Report surveys run by Puppet Labs, which led to the emergence of this defensible research and the creation of DORA over several years.
This research has been analysed, critiqued, and studied by industry analysts, and has stood up to scrutiny. It is practically the only independent and academically robust research we have on agile methods, and our industry has a history of erasing female contributors to software engineering, so I presume you will rectify this in the next revision.
Contribution analysis? ¶
This whole segment, which appears to be a cornerstone of the paper, is unfortunately simply misguided. It falls into the fallacy that software development is reducible, like building a wall, where you can measure who lays the most bricks, or whose work is the neatest or needs the fewest corrections. I wrote an article recently that illustrates how misleading individual contributor metrics can be, especially for more senior developers. I expand on this below.
The second fallacy is that programming is mostly about writing code. In my experience, coding is generally the easier part. This is not to trivialise it; writing good code takes experience. In particular, writing less code takes many years: staying focused on the immediate goal, and knowing the language, its core libraries, and its supporting ecosystem well enough that you can make choices that keep the code small and lean.
The temptation is to write more than you need “just in case”, or because you do not know that a simpler algorithm exists, or that this has already been solved in the core libraries. This is an easy way to assess the level of experience of a developer: those who contribute less code to solve the same problem, or even who remove unnecessary or redundant code, are often your highest-value developers.
Enable those highest-value developers to code? ¶
And this is the nub of the whole article. Your highest-value developers are 10x by enabling other developers! Often the least useful thing they can be doing is producing code themselves.
Goldman-Sachs notably assesses people based on impact and influence, and the balance of these shifts as they progress through the company. Early in someone’s career, the focus is on impact; what they produce and how they produce it. Later on, it is about how well they influence those around them, not just above and below but laterally across the organisation. Senior staff are rewarded for nurturing a rich network and being able to get things done.
You appear to be adopting a simplistic model where senior people just do what junior people do but more so, which means the goal is to keep their hands to the keyboard. The whole push of the article is that the best developers should be doing rather than thinking or supporting.
Next you make the assumption that the backlog is the product. Sure we can use Jira or Azure DevOps for measuring flow metrics like lead time or throughput, and for building cumulative flow diagrams, but it makes no sense to use it for individual policing.
The reality is easy to describe but difficult to measure:
- Think more, do less, slash time-to-value.
- Keep things small and simple so you can move fast and adapt.
Let me offer some examples of the power of keeping things small.
When Facebook acquired WhatsApp for its 500 million active users, WhatsApp had 13 engineers. (This is also a testament to the power of Erlang.)
The relational database SQLite runs in pretty much every compute device on the planet: phones, tablets, browsers, servers, laptops. It has millions of automated tests and only three core developers.
In 1999, two Swedish students wrote a fully compliant J2EE Application Server called Orion from scratch, using new features of Java 1.3. The compiled code was ~6 Mb, vs IBM WebSphere at a whopping 500+ Mb download. Orion’s performance wiped the floor with the competitors from IBM, Oracle, BEA and others. It would do in seconds what took minutes on other servers. Eventually Oracle bought a copy of the code for an enormous, undisclosed amount, and it replaced Oracle App Server.
Inner-outer loop? ¶
I am finding your model of inner and outer loops confusing. They certainly do not reflect my experience of enterprise software development, which I presume is your target audience.
The idea that security or compliance audits and requirements are a separate review activity in some outer loop is in direct conflict with the principle of shifting left that McKinsey promotes. Security and compliance should be an intrinsic part of software development.
In his book Turn the Ship Around, former US submarine commander David Marquet introduces the idea of inviting inspection. Historically, the crew of a submarine would be suspicious and hostile towards inspectors. Marquet turned this around and had them piped aboard with full ceremony, and then asked to see the complete list of issues and suggestions, promising to have them all resolved by the next inspection.
This shift in culture led to a more collaborative relationship with the inspectors, and both they and the crew started looking forward to these inspections. In the same way, high-performing teams welcome experts in security, compliance, accessibility, resilience, and similar cross-functional factors as consultants into their development activities rather than as gatekeepers at the end, and ensure that their architecture, design, even their choice of technology reflect these important business needs.
Meetings, too, are not an outer loop activity. Any collaboration is a meeting! Pair-programming is a continual meeting, during which the pair discusses various design ideas and approaches, tries some of them, and co-creates a solution; likewise ensemble programming. Again, “code-test-build” is the easy part; what about solve? That is what these in-the-moment whiteboard sessions, or more structured deep dives, are for, and they should occur inline with other development activities to maximise flow of value creation.
I recognise that there should be a regular cadence for reviews and other structured engagement with external stakeholders, to demonstrate progress and elicit feedback, but that is not what your article is suggesting.
Managing test data is “low value”? ¶
You suggest that managing test data is a low-value activity. As well as the reputational risk of accidentally using live customer data, many territories have legal constraints around use of customer data for anything other than its stated purpose.
I was working at a major ISP who accidentally emailed 16,000 active subscribers to tell them that their broadband was being terminated by using real addresses in a test run. As you can imagine, this took considerable time and effort to resolve, as well as public embarrassment.
Data cleansing and lifecycle management of test data should be a continual and deliberate activity. I think of it in the same category as security or compliance, and “invite inspection” in a similar way.
Infrastructure work is “low value”? ¶
Optimising the path to live is a critical function of high-performing teams. A fast, automated release pipeline is a key prerequisite for frequent releases, and skill in defining, provisioning and evolving the infrastructure to support this is a differentiator.
I have seen 10x or even 100x impact in cloud operating expense due to poor choices or lack of understanding in infrastructure provisioning and management, or more accurately a 100x reduction in op-ex costs due to a focus on getting the provisioning right.
Your categorisation of “low-value” work again seems subject to the idea that any non-coding activities are of low value, which is demonstrably wrong.
Talent capability score? ¶
I have seen many attempts to catalogue skills and capabilities, and I have run several such exercises myself. Working off a generic list of skills and capabilities does not make sense through the lens of Theory of Constraints. Instead you should develop a skills catalogue based on the client’s organisation: their technical estate, their domain, their constraints.
When you are putting a team together, key success factors include familiarity with the systems being worked on, with the business domains and processes involved, with upstream and downstream dependencies, with relationships with stakeholders, and with the primary technologies involved. These are the kinds of thing I like to model on a capability map: skills, capabilities and knowledge that are directly relevant to the demand rather than generic and vague.
Your example where a client has “more developers in ’novice’ capability than was ideal” unfortunately makes no sense. Each developer will have a unique skills profile, being more experienced in some areas and less in others. If you are talking about a lack of capability in a specific area, the example should show this. If you are coming up with some linear function based on all the capabilities, then this is neither actionable nor useful.
Next, how are you or they defining “ideal” in this case, or is this based on a generic curve like a normal distribution? This is the same error that we find in stack-ranking of teams. In a high-performing team, all the players are on one end of the curve! There is no normal distribution and there is no “weakest link”.
In terms of resourcing teams, a lack of capability in an area is irrelevant if the team’s ability to deliver is not constrained by this lack. If it is, then of course the team or organisation should be looking to bridge this gap, but this can often be achieved in other ways, by changing the approach or the solution, say, as much as scrabbling to skill up in an area.
Too complex to measure? ¶
Finally, you make the assertion that engineering is too complex to measure1, which I disagree with. Applying reductionist thinking and tooling is a category error. “Better” reductionist tooling is just doubling down on doing the wrong thing with the illusion of rigour.
Software development is a collaborative, generative enterprise, and we can easily measure the effectiveness of a team, and more importantly how this is trending over time, using Theory of Constraints and flow-based metrics like lead time and throughput. Take a look at the work of Eliyahu Goldratt and Donald Reinertsen for a deeper understanding of these. An organisation like McKinsey endorsing these approaches would be a great help to the industry.
However, attempting to measure the individual contribution of a person is like trying to measure the individual contribution of a piston in an engine. The question itself makes no sense.
Your conclusion is strong but seems muddled. I love your call to action to learn the basics and to assess your systems, although code coverage is a well-documented red herring, since it is neither necessary nor sufficient to demonstrate quality. You suggest to build a plan, which makes sense. Then you conclude that measuring productivity is contextual, while the entire article is promoting your context-free assessment tooling.
Some suggestions ¶
Thank you once again for inviting me to review your article. You are tackling an important and meaty topic, and I hope my feedback helps you towards a more suitable and well-founded article. I am all about accountability, and I find it useful to assess individuals to help them with their own professional development as well as for the company’s benefit.
However, rather than measuring the person in isolation as your article suggests, I take a holistic view of poor performance. If we assume we are hiring smart and motivated people—and if we are not then we need to look at our recruiting operation—then if someone is identified as a “weak” performer, I take that as signal that we are letting that person down.
What is it about the system of work—be it the type of work, the suitability of that person for this work, the tooling or support, organisational constraints, or other confounding factors—that is making it hard for this person to succeed? It is rare, although not unheard of, to just “have a bad performer” in my experience. More likely, the person is responding to the context they are in, and if we remove them, then whoever we replace them with will likely fall prey to the same constraints.
In terms of personal development, I like to coach individuals to market themselves, for example by keeping an achievements diary that they can look back on when they have periodic reviews, and to work on their own personal brand in the organisation. I prefer in-the-moment feedback and frequent 1-1 meetings rather than batching up to once or twice a year, which can feel like something of an ambush.
If you are going to assess individuals in a team, then use peer feedback to understand who the real contributors are. I ask questions like “Who do you most want to be in this team, and why?”, or “What advice would you give to X to help them to grow?”, or “What do you want to most learn from Y?” An astute manager will quickly see patterns and trends in this peer feedback, both positive and negative, which are then actionable.
I guess a lot of this comes down to one’s motivation for individual assessment in the first place. If the mindset is “weeding out poor performers” then this suggests an abdication of responsibility; the idea that the flower is at fault for failing to grow rather than the soil or climate. In this context it does not really matter what tool we use to “assess” people, because we are clearly solving the wrong problem.
I wish you the best of luck with the next revision of this article. As I said, it is a meaty and complex topic and I am glad that an organisation with the visibility and reputation of McKinsey is choosing to tackle it in a white paper.
Kind regards, Daniel Terhorst-North
Correction: I misread the section Avoiding metrics missteps, which in fact says to “move past the outdated notion that […] engineering is too complex to measure.” However, my point stands: the article still proposes a set of meaningless reductionist metrics, which makes the title of this section doubly ironic. ↩︎