During the past decade, DevOps has become a common approach across the majority of IT organizations. However, many enterprises primarily focus on improving the “development” part of DevOps. In most large organizations, the operation teams continue to work as disjointed units utilizing traditional Information Technology Infrastructure Library (ITIL) processes for ITOps. Google has chosen different delivery approach. Called Site Reliability Engineering (SRE), this model and its variants were historically used by tech companies to bring efficiency to their IT Operations. However, as leading “traditional” organizations have begun to drive IT transformation, the SRE Model has been gaining popularity among large organizations as well.
According to Google, “SRE is fundamentally doing work that has historically been done by an operations team, but by using engineers with software expertise and with the belief that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labour. Site reliability engineers create a bridge between development and operations by applying a software engineering mindset to system administration topics. They split their time between operations/on-call duties and developing systems and software that help increase site reliability and performance.”
Five core constructs of the SRE model lie at the heart of its adoption by IT organizations.
- Proactive assurance of stability: The SRE model calls for site reliability engineers to spend only 50% of their time on incident resolution, with the remaining 50% spent on proactive improvements to system stability and resilience.
- Fact-based decision-making: Instead of subjectively prioritizing issue resolution, the SRE model uses a concept called “Service Level Objective” and real-time measurement of Error Budgets to provide bandwidth for Ops teams to focus on critical system stability issues.
- Balance between new changes and system stability: Most organizations see friction between Dev and Ops, with the Dev team asking to push more changes, and the Ops team restricting changes to ensure stability. The SRE model brings balance to this with concepts like Error Budget measurement.
- Faster incident resolution: The SRE model expects engineers to automate manual tasks, thereby improving operational efficiencies and reducing the time required to detect and resolve incidents.
- Collaborative Dev and Ops functions: The SRE model recommends an integrated Dev and Ops team that utilizes the power of digital collaboration tools
The SRE Model achieves these outcomes largely due to its differences from the traditional IT Operations model.
How Large Enterprises Can Adopt the SRE Model
Although the benefits of the SRE model are clear, large organizations often struggle to implement it. There are multiple reasons for this.
- Framework Availability: It is important to define a framework that’s customized to the organization. Such a framework should include ITSM processes aligned to SRE principles, specific roles and responsibilities, setup of monitoring and measurement tools, and collaboration and automation tools that help the organization embrace new ways of working.
- Skilled Resources: Multi-skilled individuals are critical for SRE model success. It is not easy to get high-skilled software engineers to work for IT Operations, so this step may require special incentive and rewards programs.
- Tools & Technologies: SRE relies on aligning automation and production effectiveness, which digital tools can help achieve in the SRE implementation journey.
- Process adaptations: Many large organizations have cumbersome processes for aspects such as change management or incident management. To make the SRE model effective, these processes may need to be adapted to rely more on technology gating than on manual governance.
- Cultural Changes: The SRE model brings new concepts that challenge traditional thinking. It will require a good level of evangelization across the organization to bring the cultural change necessary to support the new model.
To overcome these challenges, large organizations can take the following five-step systematic approach.
Discover
This phase focuses on baselining the current state of the organization, including the maturity of the processes, team skills and tool availability.
Define
This second stage involves defining an SRE Operating Model that includes the team structure, roles and responsibilities, and an interaction model with a view of the target state. This phase also focuses on top-priority aspects such as Service Tiering and classifying Core and Non-Core services. During this stage, the process definitions are tuned to new operating model.
Implement
This phase sees the first stage of the model become reality, with the team starting to work on an SRE construct an SRE Backlog. The team will be supported by monitoring SLOs and SLIs, and they will start focus on the automation of routine tasks.
Adopt
This phase is the most complex part of SRE Model adoption. Here, the model is expanded to Business teams and Development teams, causing them to experience a significant cultural shift and migration from traditional thinking to new ways of working. It is critical to have significant support from senior leadership throughout this stage.
Scale
Once the organization has buy-in from all stakeholders, the model is ready to scale to wider units and services. At this point, teams should have ironed-out the finer aspects of their interactions and be equipped to work seamlessly.
Adopting the SRE Model is a continuous journey and requires support from management and all stakeholders. As enterprises move their IT landscape to the cloud, large IT organizations are going through significant transformations that rely on positive customer experiences, the ability to bring capabilities faster to market, and the flexibility to adapt to changing demands. IT operations cannot be left behind in this transformation. Models like SRE, and its underlying principles, are more critical than ever before as organizations transform their Ways of Working to fit the Digital Age.
Acknowledgements and References
- Input/discussions with Venkatesh Adiki, Soumen Chatterjee
- SRE Workbook