Some of the team went to the August 15, 2018 Site reliability meetup at Twitter. Here are some quick notes we took.1

Brian Weber from Twitter: “Diving into Production Issues at Scale”

SREs understand that there is a customer on both sides of the service: the internal customer and the external customer. You must keep the interests of both customers balanced.

Editorial: there are some concerns with the concept of internal customers.

To find the problem, you must:

  • Follow the logs
  • Follow the code
  • Follow the next…?
  • Follow the documentation
  • Follow the documentation again
  • Repeat

Tribal knowledge can be a blocker. Not everyone knows everything. SREs *must* write down tribal knowledge. Share and document. Document and share.

Know the difference between your environments. Write them down for your successors. Code can express itself differently in different environments. Test != production.

Niall O’Higgins: “Building SRE from scratch at Coinbase during hypergrowth”

What is an SRE?

  • Firefighting
  • Pagers
  • Ops

No. SRE is the answer to these problems.
Editorial: Consider that a better analogue for SRE is a forest manager, managing the forest so it does not burn or burns in specified ways.

Started with Google’s SRE book. Key insights for their team

  • Measure to improve human, organizational and machine systems. Not *just* machine systems.
  • Eliminate toil! Proactively find the toil and figure out how to not do that.
  • Build in org back pressure (error budgets, etc.) – forces a shared goal.

When implementing, consider focusing on the promises we make including promises between systems, between people and systems and between people (expanded to teams). Mentioned the book “Designing systems for cooperation” by Mark Burgess.

  • SLIs
  • Each service has its own health metrics
  • Be careful of indicator overload and over-instrumentation.

Develop Four Golden Signals

  • Work with each team directly
  • Start with initial spec, instrument – even if not perfect. Perfect is almost always the enemy of the good.
  • When promises break: That’s called Incidence Response
  • Measure quality of incident response. See: Microsoft Incident Response and shared responsibility for cloud computing
  • Quantitative measures: Time to detect, Time to Engage, Time to Fix
  • Qualitative measures: Communications: Did we notify right stakeholders in timely fashion? In retro: Is the runbook updated?
  • Tooling: Reduce toil

SRE helps you keep your promises

  • Reliability engineers can act as “sensors” within other teams if they are embedded.
  • Can “Parachute in” and help define
  • Eliminate toil through tools and process
  • If you use an embedded model, make sure you communicate with the “centralized brain” of the SRE team

Mentioned he was still hiring SREs.

Thanks to the Twitter SRE team for organizing another informative meetup.

1. Like live tweeting without the “live”.

Share with: