Editor’s note: This blog is the second in an ongoing series where Kyoko interviews SREs about the role of site reliability engineering and the challenges and important aspects of the profession.
Earlier this month I had a chance to speak with Buffer system engineering lead Steven Cheng. As you may know, Buffer is famous for its radical transparency and its fully remote team. When Steven visits our kaizenOps.io office in San Francisco, we are always inspired by his wit and insights. This time, he shared his thoughts about site reliability engineering as a culture as well as how it expresses itself as an operational role. This blog is a condensed and edited version of our conversation.
What is an SRE?
I think there are two aspects to look at when we talk about SRE: the process and the culture. The key is deeply related to the service level objective that a company has. Then you need to define what SRE means to your engineering team. For example, a startup like us can move fast and break things. This kind of mentality is deeply held in the startup culture, so when something goes wrong, we do a post-mortem and move on. In bigger organizations that have strict service-level agreements, they just can’t move that fast because their focus is on not breaking SLAs. Any downtime can have an overly large impact on team velocity and the overall development culture. If an incident uses up all of the downtime allocated for the year, the team is required to slow down to make sure the yearly SLA isn’t breached.
What are some of the important characteristics or knowledge that have made you thrive as an SRE?
I thrive in reactive kind of work, which is something not many people like. In a reactive environment, you get paged in the middle of the night, when things are on fire, and you have to put the fire out. I like this kind of stress. I enjoy diagnosing the problem under pressure.
What does success mean to you?
Within Buffer, I think the assessment for how good you are as an SRE is not well established – yet. The team feels that I can diagnose problems and provide the visibility with everyone inside and outside the team. This work provides a good experience for the team. For me, this interaction has been the measurement for success.
I want to thank Steven for his time and willingness to share.
I am looking to interview other site reliability engineers. If you are interested, please let me know.
To follow Steven: