Silverchair Insights: Identifying “Toil”

14 September

Silverchair’s most recent company meeting featured a session introducing employees to the idea of "toil." CTO Stuart Leitch picked up the term coined by Google's Site Reliability Engineering (SRE) team while reading Chapter 5, "Eliminating Toil," of their book (freely available online). The concept pushes us to think about the nature of our work, our team's work, and the areas where it is worthwhile to reduce or eliminate toil.

lawnmowing as toil

Why should you care? 

How much would you pay for a DVD player in the first year DVD players were introduced? How much would you pay for one with the same features five years later? As time goes on, people are willing to pay less and less for the same features. Competition and advances in technology drive prices down for pre-existing, "aging" features. At the same time, companies are continually adding new features, for which people will pay a premium.

If you're on the feature supply side and staying the same, you’re falling behind. If people are expecting to pay less for the same features over time, you’d better figure out ways for it to cost less to produce/supply.

Ultimately, to Silverchair it means that we have to constantly become more efficient in doing the things we already do (our existing features, capabilities, etc.) AND we have to find the space to build new features and capabilities (doing so with a mind toward efficiency). What can get in the way of that? Too much toil. We need to manage toil down to acceptable levels or it will crowd out our ability to competitively advance into the future.

What is toil?
Excerpt from Google's book

“Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. Not every task deemed toil has all these attributes, but the more closely work matches one or more of the following descriptions, the more likely it is to be toil:

This includes work such as manually running a script that automates some task. Running a script may be quicker than manually executing each step in the script, but the hands-on time a human spends running that script (not the elapsed time) is still toil time.

If you’re performing a task for the first time ever, or even the second time, this work is not toil. Toil is work you do over and over. If you’re solving a novel problem or inventing a new solution, this work is not toil.

If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is toil. If human judgment is essential for the task, there’s a good chance it’s not toil.

Toil is interrupt-driven and reactive, rather than strategy-driven and proactive. Handling pager alerts is toil.We may never be able to eliminate this type of work completely, but we have to continually work toward minimizing it.

No enduring value 
If your service remains in the same state after you have finished a task, the task was probably toil. If the task produced a permanent improvement in your service, it probably wasn’t toil, even if some amount of grunt work—such as digging into legacy code and configurations and straightening them out—was involved.

O(n) with service growth
If the work involved in a task scales up linearly with service size, traffic volume, or user count, that task is probably toil. An ideally managed and designed service can grow by at least one order of magnitude with zero additional work, other than some one-time efforts to add resources.”

A few examples of toil:

Should toil be avoided at all costs? Absolutely not!

Not all toil is equal. Some toil is economically avoidable and some is not. If we're going to invest resources toward reducing or eliminating toil within a particular area, there better be a compelling payback period and return.

For instance, if you have a toil activity that takes you 1 minute a week (52 minutes a year) to perform but 40 hours to automate away, it would take 46+ years before you'd reach payback and start to see a return on your investment. Additionally, automation isn't free - it's not "build it and forget it." It needs to be maintained/adjusted if the nature of the activity shifts in any way. So it’s important to consider the cost and return for reducing or eliminating any toil activity to make sure the numbers add up and it's economically avoidable toil; it won't always be worth tackling.

Okay, now what?

Interested in applying this concept? Here are my recommendations:

I see toil as an idea, a piece of vocabulary, to use while we reflect on how we're spending our time and consider where and how we can improve. We’re excited to use it as one of many tools in Silverchair’s ongoing development and evolution.

Jon Meadows, Software Quality and Release Leader  


Source: Beyer, B. and Jones, C. and Petoff, J. and Murphy, N.R.; Site Reliability Engineering: How Google Runs Production Systems; 9781491929124;; 2016; O'Reilly Media, Incorporated.

Back to News & Events