KaizenOps.io is a brand new take on production incident management with heavy reliance on AI. We aim to delight our users, mostly Site Reliability Engineers and DevOps practitioners, by delivering actionable insights during production incidents. When an incident is triggered we process vast sums of monitoring data so our users don’t have to look for that needle in the haystack by themselves during a high-pressure production incident. Delivering useful incident insights to fast-moving teams and their ever-changing complex environments is not easy.

Our engineering team consists of myself and our founders Mark and Nate. The three of us have over 40 years combined monitoring product development and research experience. KaizenOps.io is a culmination of many ideas we have honed over the years in the monitoring industry.

Our product strategy is bold and opinionated; we believe this space doesn’t need yet-another-pretty-dashboarding-tool, but a true intelligence service that interprets the vast amounts of existing data. When an incident occurs, we interpret our user’s data, explain what it means for their customers, and suggest actions to take. All this happens in Slack, never taking them away from their usual day-to-day workflows.

In a series of posts, I will explore the architecture we chose and why we think it is suited to enable a small team of engineers and domain experts to be agile and productive. We will talk about our triumphs, struggles, and lessons learned.

kaizenOps.io Architecture

At its core, kaizenOps.io is a complex event processor. When an incident occurs within our customer’s environment, we get to work. An incident might require us to reach out to some or all of the monitoring services the user has for their applications and environments. We compile the relevant data and apply a series of machine learning algorithms to establish the impact of the incident and discover the cause. We also fingerprint each incident to check if it has happened before. If it did, we report the remedy so our users can take action quickly.

Architecturally, this type of service calls for a very elastic data path that often requires parallel retrieval and processing. We employ AI techniques that are very resource hungry. If our users’ environments are not experiencing a problem, kaizenOps.io will be idling. But once an incident occurs, we must scale rapidly to run hundreds of thousands of computationally heavy tasks in parallel.

While we are all familiar with classic elasticity containers and microservices, the level of elasticity we need calls for a radically different architecture.

AWS Serverless Architecture Model (SAM) and Lambda

AWS Serverless has been a fantastic fit for our workload and engineering team. AWS allows us to:

  • Achieve incredible elasticity without much dedicated effort
  • Have a declarative architecture
  • Pay for only for the resources used
    • Cost of idling is $0
  • Not spend time managing operations
    • there is almost nothing to manage

While classic IaaS and PaaS vendors have promised these qualities, AWS SAM takes all of these qualities to the next level. In my opinion, we have finally reached the right granularity in cloud computing that makes an engineer extremely productive by merely imagining what needs to execute, without worrying about what will happen to our precious algorithms once they are packaged and deployed. Our architectural visions now deploy as we imagine it without complex tools like VMs, Docker, EC2, etc, getting in the way and slowing us down.

AWS SAM displaces all deployment artifacts of the past while providing extreme elasticity. We aspire to use this model to its full potential.

Architecting for Change

What makes large-scale software development projects so expensive and prone to failure? You hear most engineers rightly call out feature creep and ever-changing incompatible requirements as the most common cause. While this is true; constant change is a function of the shifting market needs. Here at kaizenOps.io we don’t shy away from change; we embrace it. We know that our users and domain move so fast that we had to architect our solution for maximal flexibility.

We could not afford to build a monolithic offering that would pigeonhole us to the solutions of today. We needed a way to continually evolve and roll our technologies forward as demands and environments change. In our experience so far, AWS SAM and Lambda provide a fantastic platform that makes this type of architecture and team culture possible. We don’t have a single centralized service / container / runtime in our stack. Instead, our environment consists of a group of ‘features’ that are a group of interoperating but decoupled Lambda functions. These then scale at every level as needed. This reflects the tried-and-true UNIX development model that dictates the development of small but composable tools that do one thing well.

Functionally Flexible

A side benefit of lambda programming is that execution units remain highly parallelizable. This is also a desirable trait shared with functional languages like our language of choice – Clojure. Functional programming inherently dictates stateless mini execution units that are easy to write, test, and evolve. Further, when applied properly, functions are highly composable. Combining these traits with AWS SAM’s amazing scale-out capabilities, you don’t need an army of engineers to build incredibly powerful analytics engines.

In later posts we will dig deeper into kaizenOps.io architecture and talk about our technical triumphs, failures, and lessons learned in more detail.

All of the work we put into our solution is to help you in times of site crisis. Get help for your SRE team, get kaizenOps.io.

Share with: