The point of software is to do something. For example, when a user clicks on the “Add to Cart” button, the user expects the item to be in their shopping cart quickly and then sent on their way to continue shopping or to checkout. Let’s call the completion of these requests the experience unit of work.
More often than not, though, new technology or frameworks are implemented without consideration for truly knowing how long it takes for the unit of work to complete. Time and again, inventors and architects create software without the measurements needed to provide the whole picture; especially in Production environments at scale.
How this comes about is understandable. Some intrepid individual is excited about coding his idea together and sharing it with the world. They are drawn into the details and lose sight of the big picture. Early adopters start actively using the technology at scale to serve customers. Eventually, someone responsible for keeping this stuff up and running pauses and says – wait, how do I know if this is running well enough for everyone we are serving?
An example of this phenomenon was the rapid adoption of NodeJS. Proponents of asynchronous transactions scoffed at the notion of wasting CPU cycles waiting on slow non-cpu activities like reading from a filesystem or waiting on a network connection. Why wait at all? Decouple timing, fire off any commands doing significant work, and register a callback to take next steps if needed.
Asynchronous transactions opened unique possibilities for easily adding complementary tasks to a request. For example, the “Add to Cart” button on day one of service does nothing more than add the selected item to a shopping cart. Along comes the app owner and she wants to see who’s putting which items together. Later, a product manager wants that logged into Slack, and a CEO wants a report sent to his iPhone. In an asynchronous NodeJS app, a dev simply needs to add an independant callback for each of these additions.
Nimble changes like this are possible because the engine/language is designed to have zero waiting for tasks and everything is built without dependencies. In the synchronous world, if one wanted to enhance “Add”, all sorts of interactions need to be considered less you severely impact your user’s experience. You likely need to enhance your framework and supporting classes to accommodate the new flow, and possibly introduce entire new testing facilities. Synchronous dependencies vs. asynchronous freedoms introduced unprecedented flexibility.
But this flexibility came with a cost. The extreme decoupling made measurement impossible. While there were no bottlenecks waiting on slow tasks like file system access, NodeJS applications were jumbles of subroutines without a connection. If you’re trying to measure how many “Add to cart” submissions users made and how long they took, how can you measure them if the action relies on 30 disconnected library calls finishing?
Eventually, the keepers of NodeJS agreed that not being able to measure experience of the millions of users their creation served was a bad thing. They added a mechanism to help stitch together a user request to all the subroutines. Although this did add minimal overhead, the ability to measure outweighed the blindness.
If you find yourself creating a new technology, or even a regular app, ask yourself a simple question – do you know how to measure what each experience unit of work is doing? If your technology scaled to billions of requests, would you know how well, or even if, each unit of work completed?
Focusing on measuring the experience the software delivers is the work that SRE’s need to encourage and instrument. Knowing when and if your software fails any user is the only true way to focus on the fix or automate the mitigation.
We’d like to hear your stories of instrumentation and handling production incidents. For 15-30 minutes of your time, we’ll buy you dinner.