Vasu

Vasudevan Gopalan

Software engineering is akin to having children; the labor before birth is painful, and the labor after birth is where we dedicate most of our efforts😊.

Software engineering as a discipline spends more time talking about the first period, but research clearly suggests that 40-90% of the costs are incurred after the birth of the systems. These costs are incurred to keep the platforms reliable.

Why should platforms be reliable? Because the average consumer demands speed, convenience, and reliability from every digital experience. While availability focuses on the platform’s operational quotient, reliability focuses on the platform’s useful quotient.

Site Reliability Engineering is a practice and cultural shift towards creating a robust IT operations process that would instill stability, high performance, and scalability to the production environment.

Reliability is the most fundamental feature of any product; a system is not useful if nobody can use it!

Site Reliability Engineers (SREs) are engineers – applying the principles of computer science and engineering to the design and development of computing systems, generally large, distributed ones. As Ben Treynor Sloss of Google states – SRE is what happens when a software engineer is tasked with what used to be called operations. Automation, Self-healing, Scalability, Resilient – these characteristics become mainstream.

An SRE function is run by IT operational specialists who code. These specialist engineers implement a software-first approach to automate IT operations and preempt failures. They apply cutting-edge software practices to integrated Dev and Ops on a single platform and execute test codes across the continuous environment. They possess advanced software skills, including DNS Configuration, remediating server, network, and infrastructure problems, and fixing application glitches.

The software approach codifies every aspect of IT operations to build resilience within infrastructure and applications. Thus, changes are managed via version control tools and checked for issues leveraging test frameworks, while following the principle of observability.

The Principle of Error Budget

SRE engineers verify the code quality of changes in the application by asking the development team to produce evidence via automated test results. SRE managers can fix Service Level Objectives (SLOs) to gauge the performance of changes in the application. They should set a threshold for permissible minimum application downtime, also known as Error Budget. If the downtime during changes in the application is within the scale of the Error Budget, then SRE teams can approve it. If not, then the changes should be rolled back for improvements to fall within the Error Budget formula.

Error Budgets tend to bring balance between SRE and application development by mitigating risks. An Error Budget is unaffected until the system availability falls within the SLO. The Error Budget can always be adjusted by managing the SLOs or enhancing the IT operability. The ultimate goal remains application reliability and scalability.

DevOps and SRE

We know that DevOps is all about culturally combining development and operations. While DevOps dwells on what needs to be done for this, SRE focuses on how it must be done.

DevOps brings the traditionally separate teams of development and operations under one roof to improve upon collaboration, communication, integration, and software releases. This is accomplished by the focus on end-to-end automation of builds and deployments as well as effectively managing the entire infrastructure as code.

SRE is a discipline that incorporates the various aspects of software development and applies it to issues and tasks in IT operations specifically. The main objective of SRE is to develop a highly reliable and ultra-scalable software application or system. The prime focus is to completely automate (if not all) the tasks to ensure reliability in the systems. The ‘relentless’ pursuit of automation in SRE helps brands eliminate manual work, giving developers more time to innovate and create.

Also, in comparison to DevOps, SRE provides a good set of detailed steps in each part of the framework to reach a particular goal.

While DevOps and SRE sound like they are on opposite sides of the spectrum, both approaches share the same end goals.

  • To make incremental changes fast and efficiently
  • To reduce the number of organization silos
  • To have a flexible, open-minded, and adaptable working culture
  • Use automation wherever possible
  • To monitor performance and improve when necessary

Just to slightly go back in time. In the old school /era of system administrators – Sysadmin was mostly assembling existing software components and deploying them to work together to produce a service. As the system grows in complexity and traffic volume, the need to have a larger sysadmin team comes into force, thereby increasing both direct and indirect (differences with the dev team in terms of culture, background, skill set, goals, etc.) costs to the organization. While the Dev team would want to launch new features etc., the ops team wants to maintain the status quo, to ensure service continuity. Hence, the two teams’ goals are fundamentally in tension.

Toil is mundane, repetitive operational work providing no enduring value, which scales linearly with service growth. Taking humans out of the release process can paradoxically reduce SRE’s toil while increasing system reliability.

SRE – Google’s approach to Service Management

SRE is what happens when we ask a software engineer to design an operations team, the common aptitude being developing software systems to solve complex problems. Motivated by “as a software engineer, this is how I would want to invest my time to accomplish a set of repetitive tasks.”

SRE teams generally have 50-60% of regular software engineers, other 40-50% being near software engineers who come with rarer skills like Uni system internals, networking expertise, etc.

SRE teams should focus on engineering, to avoid the fate of linear scaling up of the team. SREs, ensure that service “runs and repairs itself”. SREs typically should spend only 50% on ops work, remaining time on coding for the project itself.

Where to start?

Organizations must identify change agents who would create and promote a culture of maximum system availability. They can champion this change by practicing the principle of observability, where monitoring is a subset. Observability essentially requires engineering teams to be vigilant of common and complex problems hindering the attendance of reliability and scalability in the application. See the principles of observability below:

Tenets of SRE

For a given service, ensuring Availability, Latency, Performance, Efficiency, Change management, Monitoring, Emergency response, Capacity planning, etc.

Google operates a “blame-free postmortem culture”, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.

Change Management

Data suggest that ~70% of outages are due to changes in a live system. Remove humans, and automate to achieve the following:

  1. Implement progressive rollouts
  2. Quick and accurate detecting of problems
  3. Roll back of changes safely when problems arise

SRE represents significant break from existing industry best practice for managing large, complicated services.

Benefits of SRE

No more toiling, organizations should embrace SRE and make their end-customers happy.

References –

About the Author –

Vasu heads the Engineering function for A&P. He is a Digital Transformation leader with ~20 years of IT industry experience spanning Product Engineering, Portfolio Delivery, Large Program Management, etc. Vasu has designed and delivered Open Systems, Core Banking, Web / Mobile Applications, etc. Outside of his professional role, Vasu enjoys playing badminton and is a fitness enthusiast.