Site Reliability Engineering as a Service

Site Reliability Engineering as a Service

Site reliability engineering (SRE) merges IT (and network) operations and software development practices, with the goal of creating ultra-scalable and highly reliable production systems. Ben Treynor, founder of Google’s site reliability team, comments that SRE is “what happens when a software engineer is tasked with what used to be called operations.”

Containers and cloud platforms blur the lines between applications, networks, and infrastructure, and that alone forces an ownership and operational model that integrates the traditional development and operations teams. Historically, this would have been a recipe for disaster, but SRE removes the debate over what can be launched — and when — by introducing a mathematical formula.

In this new world, the SRE team defines a service level agreement (SLA) for the uptime of a given service. The difference between the SLA and 100% uptime is the maximum allowable downtime for errors and outages, also known as the error budget. The team is then given leeway on how the error budget is managed: If the service is meeting or exceeding its SLA, the team can plan launches or maintenance. If the service isn’t meeting the SLA, all planned activities such as new launches or maintenance are halted to focus resources on reducing downtime. This results in everyone working together to reduce errors. Meanwhile, the SRE team focuses on continuous improvement of the operating environment, writing everything from scripts to creating operating procedures.

This approach is elegant but not necessarily intuitive, and transforming an existing organization into the SRE model requires insight and finesse. Likewise, resource or political constraints may mean that the SRE function is a better fit with a neutral party. Rule4’s team of SRE experts can help transform your organization to better integrate these concepts, and we can even fill the SRE role as a service on an ongoing basis.

Reach out to Rule4 to discuss your SRE needs. We’re here to help.