How work works, and why yours doesn’t
[Cost of Delay series — Part II: “Queues and Work In Progress”]
“An Englishman, even if he is alone, forms an orderly queue of one.”
(George Milkes)
In the first article in this series, we looked at the various shortcomings in the typical approaches to set budgets for knowledge worker teams. We saw how the budget process usually puts the responsible team leader and their financial controller on opposite sides of the negotiation table. Stereotypically, the team leader inflates the real business need while the financial controller finds (real or imagined) reasons to squeeze the budget request. They haggle about cost, because they typically lack a framework to reason about budgets on the basis of business value. This, in turn, is closely related to the difficult real world question of how to measure a knowledge worker team’s performance. So defining and measuring performance is where we’ll begin the journey to give our team leader and financial controller better tools to work with.
Let’s first be clear about what “team” means. In this context, the term refers to an organizational unit deliberately grouped together to collectively take care of (part of) a value creation process. The organizing principle to consider is a common purpose or activity. What we don’t mean is a bunch of people, sharing the same boss but otherwise thrown together somewhat randomly. Think of “the finance department”, where everyone ultimately reports into the CFO but people take care of a variety of activities ranging from filing tax declarations to paying invoices to doing accounting to producing financial reports. Contrast that with e.g. a software engineering team, working together on the maintenance and new feature delivery of a software application. For organizational units managing a wide variety of activity, this simply means we’ll need to break them down in lower level sub-teams.
If we thus keep our team scope sufficiently small and simple, we should be able to identify or define a countable “unit of work” that is a good proxy for our main business activity. There is of course enormous variety in types of knowledge work, and the nature and complexity of various knowledge worker teams’ activities. Furthermore the individual team members may have very different roles, skills and tasks. Nevertheless we should be able to come up with a unit of work reflecting our team’s main reason for being. All we want to know at this point is what that unit of work is. For a team in an insurance company, the countable unit could be “claims processed”. In software engineering, something like application features or story points, and for a team of testers, the number of tests executed. The basic idea is: “If this was a production line, what would we be running on the line every day?” The unit we pick should represent the team’s value add though. A simple activity measure, such as “emails responded to” is a bad choice.
To identify our work unit, we briefly compared our creative knowledge work with physical manufacturing. In many cases it is of course very different, mostly because of the higher variability in the creative knowledge work processes. Software engineering is a good example, as it is highly variable in every way. For starters, the arrival rate of new work units can be very irregular, e.g. because of a sudden influx of bug reports following a major product launch. Second, while the production process itself is typically fixed and repeatable for manufacturing, it can be completely bespoke for knowledge work. And even if it isn’t completely custom every time, the completion time to produce a unit can be orders of magnitude different from one unit to the next. Some software features are simple to code, while others are very time consuming and difficult (or even impossible). Finally, the ‘assets’ in the knowledge work process are human beings. They will combine their production work with other activities, such as training or administrative tasks, status reviews with their bosses, and they will take vacations, be on medical leave, etc. This variability is very much the name of the game in creative knowledge work, in contrast with the standardized, predictable and repeatable world of physical manufacturing[1]. For now, all we need to do is identify our unit of work. We’ll deal with all that variability at a later stage.
One way not to deal with the variability is to try and significantly reduce it! It’s fine to standardize and simplify the repeatable parts of the process, and manage the team’s activity following some form of planning. Especially our financial controller will be very tempted to vigorously pursue standardization. But for creative knowledge work, taking standardization and attempts at planning too far is fundamentally wrong. The reason is that the variability is closely related to creativity, innovation, learning — and ultimately value. Imagine we ran an advertising agency or videogame developer… Kill the variation, and we also kill the value. For creative work, we need to embrace and even foster variability. But then what should we optimize for? How do we define “performance” for the team? What do we look at and measure, to determine how we can improve?
Because we determined our work unit, we can define our performance goal as: “Delivering as many units of work as possible, per unit of time.” This is another way of saying we want to deliver as much customer value as we can as fast as we can. This is no different than for physical manufacturing, where this is referred to as “Throughput”[2]. So our team goal is to maximize throughput, and performance is improving if we succeed at increasing throughput. The question of what our team leader should measure and manage therefore comes down to figuring out what will help or hurt throughput. In the rest of this article, let’s step in the shoes of our team leader and give her the metrics she needs.
The first thing our team leader should analyze is how long it takes to process one work item. The total time from the moment the team starts working on the item until they stop working on it is called “Cycle Time”. This is easy to record, she can start any day at no cost using a simple spreadsheet. She doesn’t necessarily need to track every single work units. Even after only a handful of samples she’ll have data on how consistent or variable cycle time is. The next step, requiring a bit more tracking work with the team, is to record how much of cycle time is actual processing time, when the team is actively doing value-add work on the unit, and how much is non value-add waiting time. Ideally she also records the activity during every burst of processing time, as well as the reason of waiting time (“no team member available to do the work”, “waiting for the code to build”, “need review approval to continue”). But that’s even more work, of course. Regardless of the level of detail, he team leader is likely to discover how low processing time actually is. As an illustration, think about an analyst tasked to write a report. Even a fairly simple analysis job will involve some requests for data which take a few days to arrive, then time spent to organize the data (which won’t happen immediately, the analyst will weave it in between other tasks), then some analysis work in a few sittings of 1–2 hours each and finally a few hours more writing the actual report. From start to end, turning this around in 2 weeks wouldn’t strike us as especially slow. So cycle time is 2 weeks but actual productive working time spent by the analyst adds up to something like 8 hours, a full work day. Processing time of 8 hours out of 336 hours total cycle time, is 2.4%. Even if you only count office hours, actual processing time is less than 10%. These numbers are by no means exceptional, they’re only shocking the first time we are confronted with the facts.
Tracking and analyzing cycle time helps the team leader find ways to reduce it and speed up throughput. She’ll see that there is most to be gained by looking at the >90% waiting time. Contrast this with the reflex of most people — not in the least her financial controller colleague — to scrutinize the <10% processing time. If actual processing time can be reduced by working more efficiently, there’s of course nothing wrong with that. And obviously, not all waiting time is wasted: in our example, the analyst can do other work while waiting for e.g. the data to come in. But as we’ll see a little later, reducing waiting and cycle time will yield net performance gains. These gains will typically be bigger than tinkering with processing time. So the mental shift the team leader needs to make is to stop looking at the time spent by the team members, and instead focus on the time spent by a work unit. And she should take her financial controller along the analytical journey.
The team leader’s next metric, again not that hard to start measuring, is the time between arrival of a new work unit request, and its delivery. Arrival could be a customer order, or some other trigger to request work from the team. Delivery often (but not always) coincides with work completion. This metric is called “Lead Time”. It’s different from cycle time, because the team will not immediately start work on a new item the moment it comes in. Likewise, a completed work unit may not immediately be delivered, e.g. because it needs to be tested or is part of a customer delivery batch.
Looking at lead time, the team leader takes the perspective of the customer, who will be utterly unimpressed by a team cycle time of 2 days if their lead time is 3 months. Remember we defined throughput as putting customer value in customer hands. One reason to deliver quicker and shorten lead time is financial: earlier delivery means earlier payment. Our financial controller will like that. But for our team leader, there’s even more in it. She’ll get a shorter customer feedback loop, telling her as early as possible if something is missing or sub-optimal in the work units. That means her team won’t produce many more such work items. And if rework is required, e.g. bug fixing, it will be a lot easier and faster for the team if their memory is still fresh. The more team success depends on learning and quality of the output, the more our team leader will be interested in shortening lead times.
Measuring cycle and lead time of (some or all) work units gives our team leader data to analyze and pointers where to look for performance improvements. Using these metrics also explains why she didn’t need to worry about all the other things our knowledge workers do when she defined the work unit. Any team time spent on training, administration, meetings, management reviews, status reports… are potential sources of waiting time, and can therefore creep into cycle time and lead time. She’ll need to analyze to what extent something needs to be done about them (and probably decide that killing a few meetings wouldn’t hurt to help improve throughput.)
There’s one more metric for our team leader’s arsenal, which is simply tracking the number of work units in the system. For all units the team is working on, we’ll use the term “Work In Progress” (WIP). Other units between customer order and customer delivery, we’ll call “Queue”[3]. In principle, the team leader doesn’t want to have units in queue that aren’t work in progress yet. Every unit in the queue is a monetizable opportunity, but no value-add work is done until it moves to WIP. So she may be tempted to have the team start to work on every new work unit as soon as the customer order came in. Imagine she started with 10 units WIP and another 10 in queue and voila, now she has 20 units in WIP. “But it won’t work, who am I kidding?” she’ll realize. “If I double the items in WIP, that will just double cycle time and it won’t make a difference.”
Actually, it’s worse than that… This is what will happen to cycle time when WIP increases.
As long as team capacity utilization isn’t pushed beyond 60–70%, things are mostly fine. But if the team leader stuffs too many simultaneous work units into WIP, or is unaware of this relationship, cycle time will exponentially increase and throughput rate will deteriorate. The cause of this relationship is the increasing time consumed by the overhead activity to manage and coordinate the work items. One such overhead penalty is the switching cost (expressed in time, in this case) when knowledge workers need to mentally context-switch between different work items[4]. Another overhead penalty is incurred by keeping track of the work items: if more work items are work in progress, management (and financial controllers) will ask for status reports and reviews, which cost time, which keeps items in work in progress longer, which increases the number of items in work in progress…[5]
There is an analogy that makes this easier to understand. The analogy is traffic. Say we’re interested in getting as many vehicles as we can through a stretch of highway. One way to do that is to increase the speed at which we move these vehicles through the stretch — that would be the equivalent of shortening cycle time. Another way is to put more vehicles on the stretch — that would be the equivalent of adding to work in progress and increasing capacity utilization. But we know that if we keep adding more, at some point it begins to affect the speed at which the cars can travel. The optimal traffic flow results from some combination of higher speed (but not too high) and higher capacity utilization (but not too high). The same is true for optimal throughput.
Once the team leader has observed and experienced this dynamic, she’ll carefully control the amount of work units she allows in WIP. An easy way to accomplish that is she’ll only start work on a new unit when work on an older unit has been completed[6]. She will also have to teach this to her financial controller, who will typically think in terms of people busyness and (very mistakenly) assume people and team capacity utilization of 90% or higher are good and desirable. Fortunately she’ll have the data to demonstrate that this is flawed thinking. Still — while the team leader is right managing for optimal capacity utilization, she and her financial controller won’t be happy if there’s valuable work units waiting in queue while the clock is ticking. That will be the topic of a next article.
Notes
[1] This discussion is getting a little unfair to modern physical manufacturing, which nowadays can be quite complicated and variable too. Throughout the text, we’re comparing to the (over)simplified mental model of the widget factory prevalent in so much of the professional literature.
[2] The term “Throughput” was popularized by Eliyahu Goldratt in “The Goal” and several other books and publications commonly known as “Theory of Constraints”. It is still highly recommended reading… but be aware that most of the literature discusses physical manufacturing and is not the easiest way to think about the application of the ideas and principles to knowledge work. The knowledge work field which progressed the most is (you guessed it) software engineering where optimizing throughput is often referred to as “flow”.
[3] Software engineering often uses the term “Backlog” for pending work units that aren’t work in progress yet. Strictly speaking our queue definition also includes items which are completed, but not delivered yet. These are ignored in most discussions though, including ours here.
[4] You can easily experience this for yourself by trying to work on twenty different things in one day as opposed to focusing on a small handful.
[5] Another implication of this relationship is that measurements of a work item’s cycle time and lead time are a function of the team’s total capacity utilization at that moment. Team leaders who track these metrics but don’t have a WIP constraint policy in place must therefore be careful extrapolating conclusions from these metrics to different capacity utilization regimes.
[6] This is the “kanban” method, famously pioneered and developed by Toyota.