Halving the time to get a new server

The average time for setting up a new server used to be 30 days. Now the worst case is 12 days, with figures regularly in the region of 2 hours!

A successful trading firm has new ideas all the time. It might be an idea for a new type of trade, or for making an existing trade faster, or for scaling it out to make it bigger. One of the key components of this is technology. Trading software runs on servers, and these servers need to be specified, purchased, built, installed, configured and attached to a network. This means hand-offs between the various teams that do the work. And all this takes time.

For one trading firm the end-to-end time for getting a new server used to be around 30 days—six weeks—from receiving the request to having the new server ready for use. In many companies this would be seen as quite good, but at the rate this particular firm can consume servers this was a major bottleneck.

We started by mapping the process from request to resolution, using a technique called Value Stream Mapping. This is a straightforward exercise where the team describes each stage of a process and says how long the stage takes and how much of that is spent doing useful work. For example a one-hour meeting might only contain 10 minutes of discussion relevant to this process, so you capture 1 hour elapsed time and 10 minutes of value-adding time. We also capture the time spent between stages as waiting time. We ran a low-tech exercise with sticky notes on a wall. We wanted a rough overview to show us where the hot spots were.

We looked at the ratio of the total elapsed time to total value-adding time, and learned that 94% of the end-to-end time was spent doing nothing. This is not unusual when you first start mapping value streams: we often encounter dead time well over 90%. That doesn’t mean people aren’t working, just that if you follow the journey of a server request, rather than looking at the work individual people do, it spends most of its time waiting to have something done to it. This is because we tend to focus on people’s activity rather than the flow of work. This story illustrates why.

As soon as he saw the end-to-end process laid out like this the manager who owns it identified a number of immediate actions he could take to speed up the turnaround. He established distributed stand-ups so all the people involved knew the status of all the requests in progress. introduced a single owner for each request, where before the request just moved from work queue to work queue getting worked on whenever the next person got to it, which cut one third off the end-to-end time! (It also improved the team’s image because the requester could easily find out where their server was.) Then he identified another change, moving a decision point to earlier in the process - which shaved 1/3 off the remaining time. Now the process was taking about half as long and the people were less busy! Once they had visibility of the whole request lifecycle and they knew who it was for, they were better able to prioritise and sequence their work.

The final piece was to introduce a small buffer of servers built to a standard specification in each of the data centres. This meant that if someone requested a new server, one could be available within a couple of hours. Of course the buffer needed to be restocked which still took around 12 days, but this was now happening outside of the provisioning process so the end users weren’t affected. We spent some time tuning the buffer sizes to balance the sunk cost of servers against demand.

Of course this isn’t the end, in fact it’s only the start. The team has made a number of incremental improvements to the process, like talking to the traders so they can preempt demand, and publishing graphs of their improved delivery record over time.

KotlinConf, Copenhagen	23-24 May
CraftConf, Budapest	30-31 May
GOTO Amsterdam	11-12 June
GOTO Copenhagen	2-4 October