As a user researcher, I have spoken to multiple Site Reliability Engineers to learn their pains, problems, goals, and motivations. In my previous blog “SREs, who are you?” I mentioned some background on my SRE studies. In this journey of interviewing different SREs, I am learning what SRE means varies in each organization so I thought I’d share some of those perspectives. My hope is that any DevOps leaders who are in the process of creating SRE best practices in your organization can take some ideas and make effective decisions around your organizational needs and processes.
The first SRE I had to opportunity to interview was Todd Palino.
Name: Todd Palino
Numbers of SREs: 350-400
Todd is a Senior Staff SRE from LinkedIn. He has over 15 years of experience in this field. LinkedIn launched in 2002 and is one of the most forward looking tech companies.
Q: What is an SRE?
A: Site Reliability Engineer is very much a West Coast invention. There’s a lot of different ways to describe it, but I’ve described it as a particular discipline of DevOps. But, essentially, it combines roles that many of us in the operations fields were already doing.
It combines the roles of an architect, a tools developer, and an operations person into one role where they’re responsible for all of these things, but with a focus on the automation and on developing the tooling so that you stop doing reactive work and you’re constantly focusing on the proactive work instead.
“SRE is the glue that binds the entire organization at LinkedIn together”
Q: What is the important cultural aspect for SRE to function well?
A: You have to have an organization that it has open communication and trust between the teams because if you can’t trust the other team to be doing their job properly, then you can’t be efficient and you have to constantly second guess.
If there is no trust, if there’s no openness, then you’re constantly having to watch your back for someone who’s trying to take over what you’re doing or make you look bad for some reason. Those types of environments, they’re not good for any team, but SRE just can’t function in an environment like that.
Q: What is the next goal for you?
A: Right now I’m just trying to improve the state of the art of SRE in general. Some of the tasks that I’m taking on at LinkedIn now have that goal in mind, of not only improving how we do SRE at LinkedIn and transforming what we’re doing to make it better but doing it in a way that we can then open source that work and bring it to the rest of the SRE world in some sort of consumable way. That’s mostly what I’m focused on right now, I’m starting to make a transition away from working on just Apache Kafka and working on things that are more SRE in general.
I want to thank Todd for his time and willingness to share.
You can find more about him on Twitter @bonkoif
Todd will be speaking on Apache Kafka at Kafka Summit in London on April 23-24
For more info about “Kafka: The Definitive Guide”