Site Reliability Engineering

Vasu

Vasudevan Gopalan

Software engineering is akin to having children; the labor before birth is painful, and the labor after birth is where we dedicate most of our efforts😊.

Software engineering as a discipline spends more time talking about the first period, but research clearly suggests that 40-90% of the costs are incurred after the birth of the systems. These costs are incurred to keep the platforms reliable.

Why should platforms be reliable? Because the average consumer demands speed, convenience, and reliability from every digital experience. While availability focuses on the platform’s operational quotient, reliability focuses on the platform’s useful quotient.

Site Reliability Engineering is a practice and cultural shift towards creating a robust IT operations process that would instill stability, high performance, and scalability to the production environment.

Reliability is the most fundamental feature of any product; a system is not useful if nobody can use it!

Site Reliability Engineers (SREs) are engineers – applying the principles of computer science and engineering to the design and development of computing systems, generally large, distributed ones. As Ben Treynor Sloss of Google states – SRE is what happens when a software engineer is tasked with what used to be called operations. Automation, Self-healing, Scalability, Resilient – these characteristics become mainstream.

An SRE function is run by IT operational specialists who code. These specialist engineers implement a software-first approach to automate IT operations and preempt failures. They apply cutting-edge software practices to integrated Dev and Ops on a single platform and execute test codes across the continuous environment. They possess advanced software skills, including DNS Configuration, remediating server, network, and infrastructure problems, and fixing application glitches.

The software approach codifies every aspect of IT operations to build resilience within infrastructure and applications. Thus, changes are managed via version control tools and checked for issues leveraging test frameworks, while following the principle of observability.

Best AI Auto Discovery Tools

The Principle of Error Budget

SRE engineers verify the code quality of changes in the application by asking the development team to produce evidence via automated test results. SRE managers can fix Service Level Objectives (SLOs) to gauge the performance of changes in the application. They should set a threshold for permissible minimum application downtime, also known as Error Budget. If the downtime during changes in the application is within the scale of the Error Budget, then SRE teams can approve it. If not, then the changes should be rolled back for improvements to fall within the Error Budget formula.

Error Budgets tend to bring balance between SRE and application development by mitigating risks. An Error Budget is unaffected until the system availability falls within the SLO. The Error Budget can always be adjusted by managing the SLOs or enhancing the IT operability. The ultimate goal remains application reliability and scalability.

DevOps and SRE

We know that DevOps is all about culturally combining development and operations. While DevOps dwells on what needs to be done for this, SRE focuses on how it must be done.

DevOps brings the traditionally separate teams of development and operations under one roof to improve upon collaboration, communication, integration, and software releases. This is accomplished by the focus on end-to-end automation of builds and deployments as well as effectively managing the entire infrastructure as code.

SRE is a discipline that incorporates the various aspects of software development and applies it to issues and tasks in IT operations specifically. The main objective of SRE is to develop a highly reliable and ultra-scalable software application or system. The prime focus is to completely automate (if not all) the tasks to ensure reliability in the systems. The ‘relentless’ pursuit of automation in SRE helps brands eliminate manual work, giving developers more time to innovate and create.

Also, in comparison to DevOps, SRE provides a good set of detailed steps in each part of the framework to reach a particular goal.

Best AIOps Platforms Software

While DevOps and SRE sound like they are on opposite sides of the spectrum, both approaches share the same end goals.

  • To make incremental changes fast and efficiently
  • To reduce the number of organization silos
  • To have a flexible, open-minded, and adaptable working culture
  • Use automation wherever possible
  • To monitor performance and improve when necessary

Just to slightly go back in time. In the old school /era of system administrators – Sysadmin was mostly assembling existing software components and deploying them to work together to produce a service. As the system grows in complexity and traffic volume, the need to have a larger sysadmin team comes into force, thereby increasing both direct and indirect (differences with the dev team in terms of culture, background, skill set, goals, etc.) costs to the organization. While the Dev team would want to launch new features etc., the ops team wants to maintain the status quo, to ensure service continuity. Hence, the two teams’ goals are fundamentally in tension.

Toil is mundane, repetitive operational work providing no enduring value, which scales linearly with service growth. Taking humans out of the release process can paradoxically reduce SRE’s toil while increasing system reliability.

SRE – Google’s approach to Service Management

SRE is what happens when we ask a software engineer to design an operations team, the common aptitude being developing software systems to solve complex problems. Motivated by “as a software engineer, this is how I would want to invest my time to accomplish a set of repetitive tasks.”

SRE teams generally have 50-60% of regular software engineers, other 40-50% being near software engineers who come with rarer skills like Uni system internals, networking expertise, etc.

SRE teams should focus on engineering, to avoid the fate of linear scaling up of the team. SREs, ensure that service “runs and repairs itself”. SREs typically should spend only 50% on ops work, remaining time on coding for the project itself.

Where to start?

Organizations must identify change agents who would create and promote a culture of maximum system availability. They can champion this change by practicing the principle of observability, where monitoring is a subset. Observability essentially requires engineering teams to be vigilant of common and complex problems hindering the attendance of reliability and scalability in the application. See the principles of observability below:

Best Cyber Security Services Companies

Tenets of SRE

For a given service, ensuring Availability, Latency, Performance, Efficiency, Change management, Monitoring, Emergency response, Capacity planning, etc.

Google operates a “blame-free postmortem culture”, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.

Change Management

Data suggest that ~70% of outages are due to changes in a live system. Remove humans, and automate to achieve the following:

  1. Implement progressive rollouts
  2. Quick and accurate detecting of problems
  3. Roll back of changes safely when problems arise

SRE represents significant break from existing industry best practice for managing large, complicated services.

Benefits of SRE

Best DCaas Providers in USA

No more toiling, organizations should embrace SRE and make their end-customers happy.

References –

About the Author –

Vasu heads the Engineering function for A&P. He is a Digital Transformation leader with ~20 years of IT industry experience spanning Product Engineering, Portfolio Delivery, Large Program Management, etc. Vasu has designed and delivered Open Systems, Core Banking, Web / Mobile Applications, etc. Outside of his professional role, Vasu enjoys playing badminton and is a fitness enthusiast.

API Security

Logaiswar S

“An unsecured API is literally an ‘all you can eat buffet’ for hackers.”

What is API security?

API security is the protection of network-exposed APIs that an organization, both owns and uses. APIs are becoming the preferred method to develop new-age applications. They are one of most common ways to interact between microservices and containers like systems and apps. API are developed using REST or SOAP methods. However, the true strength of API security depends on how there are implemented.

Master Data Management Software Tools

REST API Security Vs SOAP API Security

REST APIs use HTTP and Support Transport Layer Security Encryption (TLS). It is a standard that makes the connection private and checks whether the data transferred between the two systems (client and server) is encrypted. REST API is faster than SOAP because of the statelessness of nature. REST API doesn’t need to store or repackage data.

SOAP APIs use built protocols known as Web services. These protocols are defined using a rule set that is guided by confidentiality and authentication. SOAP API has not been around for as long as REST API. SOAP API is more secure than REST API as it uses Web security for transmission long with SSL.

Why is API security important?

Organizations use API to connect services and transferred data. The major data breaches through API are broken, exposed, or hacked APIs. The way API security is used depends on what kind of data is transferred.

Security testing of APIs is currently a challenge for 35% of organizations, that need better capabilities than what current DAST and SAST technologies offer to automatically discover APIs and conduct testing. Organizations are moving from monolithic web applications to modern applications such as those that make heavy use of client-side JavaScript or ones that utilize microservices architecture.

How API Security works?

API security depends on authentication and authorization. Authentication is the first step; it is used to verify that the client application has the required permission to use API. Authorization is the subsequent step that determines what data and action an authentication application can access while interacting with API.

APIs should be developed with protective features to reduce the system’s vulnerability to malicious attacks during API calls.

The developer is responsible for ensuring the developed API successfully validates all the input collected from the user during API calls. The prepared statements with blind variables are one of the most effective ways to prevent API from SQL injection. XSS can be easily handled by cleaning the user input from the API call. Cleaning the inputs helps to ensure that potential XSS vulnerabilities are minimized.   

Best Practice for Secure API

Some basic security practice and well-established security control if the APIs are shared publicly are as follows:

  • Prioritize security: Potential loss for the organization happens using unsecured APIs, so make security a priority and build the API securely as they are being developed.
  • Encrypt traffic using TLS: Some organizations may choose not to encrypt API payload data that is considered to be non-sensitive, but for organizations whose API exchange sensitive data, TLS encryption should be essential.
  • Validate input: Never pass input from an API through to the endpoint without validating it first.
  • Use a WAP: Ensure that it can understand API payloads.
  • Use token: Establish trusted identities and then control access to services and resources by using tokens.
  • Use an API gateway: API gateways act as the major point of enforcement for API traffic. A good gateway will allow you to authenticate traffic as well as control and analyze how your APIs are used.

Modern API Data breach

USPS Cooperate Database Exposure

The weakness allowed an attacker to query the USPS website and scrape a database of over 60 million cooperate users, email addresses, phone numbers, account numbers, etc.

Exploitation

The issue was authentication-related which allowed unauthorized access to an API service called ‘informed visibility’, which was designed to deliver real-time tracking data for large-scale shipping operations.

This tracking system was tied into web API in a way that users could change the search parameters and view and even in some cases modify the information of other users. Since there wasn’t a robust anti-scraping system in place, this mass exposure was compounded by the automated and unfettered access available.

Lessons Learned

Providers giving extreme power to a specific service or function without securing every permutation of its interaction flow can lead to such exploits. To mitigate API-related risks, coding should be done with the assumption that the APIs might be abused by both internal and external forces.

References:

  1. https://www.redhat.com/en/topics/security/api-security
  2. https://searchapparchitecture.techtarget.com/definition/API-security
  3. https://nordicapis.com/5-major-modern-api-data-breaches-and-what-we-can-learn-from-them/

About the Author –

Logaiswar is a security enthusiast with core interest in Application & cloud security. He is part of the SOC DevSecOps vertical at GAVS supporting critical customer engagements.