Skip to content

GAVS – Global IT Consulting

Menu
  • Platforms & Products
    • Platforms & Products

      GAVS’ products will help change how you organize your IT Operations, bring meaningful and actionable insights to speed up network fixes, provide real data as quantifiable justification to adopt strategies that foster business improvements.

      • ZIF
      • Products
        • zDesk – Remote, Secure Desktop-as-a-Service (VDI+)
        • zIrrus
        • GTOps
        • TruOps
        • Close
    • Products & Platforms
      • Reimagining your Digital Infrastructure with Zero Incident FrameworkTM

        Read more
    Close
  • Services & Technologies
    • Services & Technologies

      GAVS is a global IT services provider with focus on AI-led Managed Services and Digital Transformation. GAVS’ AIOps platform, Zero Incident Framework ™ (ZIF), enables proactive detection and remediation of incidents and increases uptime, helping organizations drive towards a Zero Incident Enterprise™ . GAVS has transformed IT Enterprise delivery through ZIF’s Discover, Monitor, Analyze, Predict, and Remediate modules, to optimize business services continuity.

      • Digital Services
        • Auto Discovery and Dependency Mapping
        • Cloud Enablement
          • Cloud Advisory and Transformation
          • Close
        • Automation
        • Blockchain
        • Close
      • Cyber Security Services
        • Assessment & Advisory
        • Identity & Access Management (IAM)
        • Managed Detection & Response (MDR)
        • Managed Security Services (MSS)
        • Security Automation
        • Risk & Compliance
        • Close
      • Data Privacy Services
      • Consulting & Implementation Services
        • Cloud Advisory and Transformation
        • Data Center Assessment
        • Data Center-as-a-Service (DCaaS)
        • Infrastructure re-engineering
        • Data Center Consolidation & Migration
        • Close
      • Application Services
      • Enterprise Support Services
        • Managed Infrastructure Support
        • Remote Infrastructure Monitoring
        • End User Monitoring
        • Close
      • Microsoft Services
    • Services &Technologies
      • Reinforcement Learning- The Art of Teaching Machines

        Read more
    Close
  • Industries
    • Industries

      GAVS Technologies focuses on serving various industry verticals in their digital transformation through infrastructure solutions, adopting innovation and technologies in different domains. We offer services and solutions aligned with technology trends to enable enterprises to take advantage of futuristic technologies like DevOps, Smart Machines, Cloud, IoT, Predictive Analytics, Managed Infrastructure Services, and Security services.

      • Industries Overview
      • Healthcare
      • Banking & Financial Services
      • Manufacturing
      • Media & Publishing
    Close
  • Inside GAVS
    • Inside GAVS

      GAVS is a global IT services provider with focus on AI-led Managed Services and Digital Transformation. GAVS’ AIOps platform, Zero Incident Framework™ (ZIF), enables proactive detection and remediation of incidents and increases uptime, helping organizations drive towards a Zero Incident Enterprise™ . GAVS has transformed IT Enterprise delivery through ZIF’s Discover, Monitor, Analyze, Predict, and Remediate modules, to optimize business services continuity.

      • About Us
      • Client Speak
      • Alliances & Partnerships
      • Leadership Team
      • Social Responsibility
      • Events
      • Locations
      • Contact Us
      • Press Releases
      • Media Mentions
      • Awards and Recognitions
      • In Memoriam
      • Covid Care
    Close
  • Insights
    • Insights

      We bring you discerning insights on technology trends, innovation and organization culture, thru our collection of articles, blogs and more. Insights reflects our passion in driving advancements as we move forward creating new paradigms in business and work culture. You would find our thoughts on a variety of topics ranging from evolving technologies and ways it affects businesses and lives, transformational leadership, high impact teams, diversity, inclusion and much more.

      • Blogs
      • Articles
      • White Papers
      • Brochures
      • Videos
      • Case Studies
      • enGAge Magazine
    • insights
      • Seven Tips for Leading IT Modernization and Digital Transformation

        Read more

    Close
  • Work With Us
    • Work with us

      What it means to be a GAVSian?

      If you rate high on our SWAT test (Smart, Hardworking, Articulate, Technologically curious), GAVS’ hiring profile, we promise you excitement, inspiration and the freedom to succeed in our flat organization. Being a GAVSian, you would represent our cutting edge in technological advancement while we help you hone yourself into the person you aspire to be. That’s the level of personal interest we invest in you.

      • Career with GAVS
      • Company Culture
      • Diversity @ GAVS
      • Building a respectful workplace
    Close
Back to blogs

Site Reliability Engineering

Jun 02, 2021
  • it infrastructure managed services
  • master data management software tools
  • microsoft cloud solution provider
  • remote infrastructure monitoring services
  • rpa in infrastructure management
SHARE

In this blog post

  • The Principle of Error Budget
  • DevOps and SRE
  • SRE – Google’s approach to Service Management
  • Where to start?
  • Tenets of SRE
  • Change Management
  • Benefits of SRE
  • References –

Software engineering is akin to having children; the labor before birth is painful, and the labor after birth is where we dedicate most of our efforts😊.

Software engineering as a discipline spends more time talking about the first period, but research clearly suggests that 40-90% of the costs are incurred after the birth of the systems. These costs are incurred to keep the platforms reliable.

Why should platforms be reliable? Because the average consumer demands speed, convenience, and reliability from every digital experience. While availability focuses on the platform’s operational quotient, reliability focuses on the platform’s useful quotient.

Site Reliability Engineering is a practice and cultural shift towards creating a robust IT operations process that would instill stability, high performance, and scalability to the production environment.

Reliability is the most fundamental feature of any product; a system is not useful if nobody can use it!

Site Reliability Engineers (SREs) are engineers – applying the principles of computer science and engineering to the design and development of computing systems, generally large, distributed ones. As Ben Treynor Sloss of Google states – SRE is what happens when a software engineer is tasked with what used to be called operations. Automation, Self-healing, Scalability, Resilient – these characteristics become mainstream.

An SRE function is run by IT operational specialists who code. These specialist engineers implement a software-first approach to automate IT operations and preempt failures. They apply cutting-edge software practices to integrated Dev and Ops on a single platform and execute test codes across the continuous environment. They possess advanced software skills, including DNS Configuration, remediating server, network, and infrastructure problems, and fixing application glitches.

The software approach codifies every aspect of IT operations to build resilience within infrastructure and applications. Thus, changes are managed via version control tools and checked for issues leveraging test frameworks, while following the principle of observability.

Best AI Auto Discovery Tools

The Principle of Error Budget

SRE engineers verify the code quality of changes in the application by asking the development team to produce evidence via automated test results. SRE managers can fix Service Level Objectives (SLOs) to gauge the performance of changes in the application. They should set a threshold for permissible minimum application downtime, also known as Error Budget. If the downtime during changes in the application is within the scale of the Error Budget, then SRE teams can approve it. If not, then the changes should be rolled back for improvements to fall within the Error Budget formula.

Error Budgets tend to bring balance between SRE and application development by mitigating risks. An Error Budget is unaffected until the system availability falls within the SLO. The Error Budget can always be adjusted by managing the SLOs or enhancing the IT operability. The ultimate goal remains application reliability and scalability.

DevOps and SRE

We know that DevOps is all about culturally combining development and operations. While DevOps dwells on what needs to be done for this, SRE focuses on how it must be done.

DevOps brings the traditionally separate teams of development and operations under one roof to improve upon collaboration, communication, integration, and software releases. This is accomplished by the focus on end-to-end automation of builds and deployments as well as effectively managing the entire infrastructure as code.

SRE is a discipline that incorporates the various aspects of software development and applies it to issues and tasks in IT operations specifically. The main objective of SRE is to develop a highly reliable and ultra-scalable software application or system. The prime focus is to completely automate (if not all) the tasks to ensure reliability in the systems. The ‘relentless’ pursuit of automation in SRE helps brands eliminate manual work, giving developers more time to innovate and create.

Also, in comparison to DevOps, SRE provides a good set of detailed steps in each part of the framework to reach a particular goal.

Best AIOps Platforms Software

 

While DevOps and SRE sound like they are on opposite sides of the spectrum, both approaches share the same end goals.

  • To make incremental changes fast and efficiently
  • To reduce the number of organization silos
  • To have a flexible, open-minded, and adaptable working culture
  • Use automation wherever possible
  • To monitor performance and improve when necessary

Just to slightly go back in time. In the old school /era of system administrators – Sysadmin was mostly assembling existing software components and deploying them to work together to produce a service. As the system grows in complexity and traffic volume, the need to have a larger sysadmin team comes into force, thereby increasing both direct and indirect (differences with the dev team in terms of culture, background, skill set, goals, etc.) costs to the organization. While the Dev team would want to launch new features etc., the ops team wants to maintain the status quo, to ensure service continuity. Hence, the two teams’ goals are fundamentally in tension.

Toil is mundane, repetitive operational work providing no enduring value, which scales linearly with service growth. Taking humans out of the release process can paradoxically reduce SRE’s toil while increasing system reliability.

SRE – Google’s approach to Service Management

SRE is what happens when we ask a software engineer to design an operations team, the common aptitude being developing software systems to solve complex problems. Motivated by “as a software engineer, this is how I would want to invest my time to accomplish a set of repetitive tasks.”

SRE teams generally have 50-60% of regular software engineers, other 40-50% being near software engineers who come with rarer skills like Uni system internals, networking expertise, etc.

SRE teams should focus on engineering, to avoid the fate of linear scaling up of the team. SREs, ensure that service “runs and repairs itself”. SREs typically should spend only 50% on ops work, remaining time on coding for the project itself.

Where to start?

Organizations must identify change agents who would create and promote a culture of maximum system availability. They can champion this change by practicing the principle of observability, where monitoring is a subset. Observability essentially requires engineering teams to be vigilant of common and complex problems hindering the attendance of reliability and scalability in the application. See the principles of observability below:

Best Cyber Security Services Companies

Tenets of SRE

For a given service, ensuring Availability, Latency, Performance, Efficiency, Change management, Monitoring, Emergency response, Capacity planning, etc.

Google operates a “blame-free postmortem culture”, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.

Change Management

Data suggest that ~70% of outages are due to changes in a live system. Remove humans, and automate to achieve the following:

  1. Implement progressive rollouts
  2. Quick and accurate detecting of problems
  3. Roll back of changes safely when problems arise

SRE represents significant break from existing industry best practice for managing large, complicated services.

Benefits of SRE

Best DCaas Providers in USA

 

No more toiling, organizations should embrace SRE and make their end-customers happy.

References –

  • Site Reliability Engineering – How Google runs production systems? by O’Reily
  • https://www.cncf.io/blog/2020/07/17/site-reliability-engineering-sre-101-with-devops-vs-sre/
  • https://www.overops.com/blog/devops-vs-sre-whats-the-difference-between-them-and-which-one-are-you/
  • https://www.cmswire.com/information-management/weighing-the-differences-between-devops-and-sre-which-one-is-right-for-you/

Author

Vasudevan Gopalan

Vasu heads the Engineering function for A&P. He is a Digital Transformation leader with ~20 years of IT industry experience spanning Product Engineering, Portfolio Delivery, Large Program Management, etc. Vasu has designed and delivered Open Systems, Core Banking, Web / Mobile Applications, etc. Outside of his professional role, Vasu enjoys playing badminton and is a fitness enthusiast.



Security Operations Management
Understanding Security Operations Management
Read More
Zero Touch Operations
Understanding Zero Touch Operations
Read More
Identity and Access Management (IAM)
What is Identity and Access Management (IAM)?
Read More
GAVS – Global IT Consulting

Copyright © 2022, GAVS Technologies.

  • Privacy Policy
  • Cookie Policy
  • Terms of use
  • Contact Us
  • Platforms & Products
    • Platforms & Products
    • Products
      • Zero Incident Framework ™
      • Products
      • zDesk – Remote, Secure Desktop-as-a-Service (VDI+)
      • GTOps
      • TruOps
      • zIrrus
  • Services & Technologies
    • Services & Technologies
    • Digital Services
      • Digital Services
      • Auto Discovery and Dependency Mapping
      • Cloud Enablement
        • Cloud Advisory and Transformation
      • Automation
      • Blockchain
    • Data Privacy Services
    • Cyber Security Services
      • Cyber Security Services
      • Risk and Compliance
      • Security Automation
      • Managed Security Services (MSS)
      • Managed Detection and Response (MDR)
      • Identity and Access Management
      • Assessment and Advisory
    • Consulting & Implementation Services
      • Consulting & Implementation Services
      • Cloud Assessment & Advisory
      • Data Center Assessment
      • Data Center-as-a-Service (DCaaS)
      • Infrastructure re-engineering
      • Data Center Consolidation & Migration
    • Application Services
    • Enterprise Support Services
      • Enterprise Support Services
      • Managed Infrastructure Support
      • Remote Infrastructure Monitoring
      • End User Monitoring
    • Microsoft Services
  • Industries
    • Industries Overview
    • Healthcare
    • Banking & Financial Services
    • Manufacturing
    • Media & Publishing
  • Inside GAVS
    • Inside GAVS
    • About Us
    • Industries
    • Client Speak
    • Alliances & Partnerships
    • Leadership Team
    • Social Responsibility
    • Events
    • Find us
    • Reaching us
    • Press Releases
    • Media Mentions
    • Awards and recognitions
    • In Memoriam
    • Covid Care
  • Insights
    • Insights
    • Articles
    • Blogs
    • White Papers
    • Case Studies
    • Brochures
    • Videos
    • enGAge Magazine
  • Work with us
    • Work with us
    • Career with GAVS
    • Company Culture
    • Diversity @ GAVS
    • Building a respectful workplace

Schedule a Demo